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Most  conventional  speech  processing  systems  have  been 
based  on  the  vocal  tract  characteristics  with  little,  if 
any,  consideration  given  to  the  glottal  source 

characteristics.  However,  it  is  known  that  the  glottal 
source  characteristics  also  play  a major  role.  In  this  study 
the  acoustic  variability  related  to  the  phonetic 
characteristics  of  glottal  source  were  explored. 

In  speech  recognition  systems,  features  based  on  a 
linear  prediction  analysis  have  been  widely  used.  Since 
linear  prediction  analysis  cannot,  theoretically,  resolve 
the  source  and  the  tract  characteristics,  the  recognition 
features  possess  redundant  information.  Thus,  it  is 

generally  accepted  that  some  normalization  technique  should 
be  incorporated  to  reduce  the  redundancy  in  the  model  of  the 
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data.  To  provide  an  analytic  basis  for  normalization,  a 
formal  model  based  on  analysis-by-synthesis  was  implemented 
and  tested. 

In  speech  synthesis,  a fixed  glottal  pulse  shape  is 
often  adopted  as  the  source  for  all  voiced  sounds.  Speech 
synthesized  with  this  approach  can  be  highly  intelligible, 
but  does  not  sound  natural  with  different  voice  types.  It  is 
believed  that  different  glottal  pulse  shapes  should  be  used 
for  synthesizing  natural  sounding  speech  with  different 
voice  types.  To  incorporate  the  capability  of  synthesizing 
different  voice  types  into  the  speech  production  model,  a 
variable  glottal  pulse  model  was  studied. 

The  procedure  of  glottal  inverse  filtering  was 
automated  to  obtain  the  glottal  volume-velocity  waveform  and 
the  closed-phase  vocal  tract  filter.  A least  squared  error 
criterion  was  used  to  obtain  the  best  fit  between  the 
glottal  volume-velocity  waveform  and  the  waveform  model 
parameters.  To  find  typical  values  and  ranges  of  the 
variable  glottal  source  model  parameters  for  different  type 
voices,  statistical  analyses  were  performed.  Speech  tokens 
synthesized  from  the  variable  glottal  pulse  model  were  then 
used  in  informal  listening  tests  to  verify  the  statistical 
results  for  different  voice  types.  The  techniques  developed 
in  this  study  can  be  easily  adopted  to  extract  voice  type 
information  from  speech  signals  or  to  enhance  the 
performance  of  speech  processing  systems. 
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CHAPTER  1 
INTRODUCTION 

With  the  rapid  development  of  digital  techniques  in 
the  areas  of  signal  processing  and  semiconductors,  speech 
processing  systems  have  achieved  major  strides  forward  and 
have  been  used  widely  in  recent  decades.  It  is  easy  to  find 
applications  for  speech  processing  systems  such  as  the 
vocoder,  voice  mail  services,  weather/time  information 
services,  data  entry  and  information  retrieval  systems, 
voice  activated  commercial  goods,  talking  toys,  speaker 
identification/verification  systems,  dictating  systems, 
text-to-speech  systems,  voice  translation  systems,  etc. 
Though  some  speech  processing  systems  have  been  implemented 
and  used  successfully  in  very  limited  fields,  most  of  them 
are  not  robust  enough  to  be  used  in  all  applications.  Thus 
much  more  work  remains  to  be  done,  especially  in  the  areas 
of  speech  analysis  and  feature  extraction,  in  order  to 
improve  and  expand  the  capabilities  of  speech  processing 
systems . 

1.1  Speech  Variability  and  Voice  Type 

Speech  is  the  conversion  of  language  into  sound. 
Speech  signals  are  composed  of  a sequence  of  distinct  sounds 
governed  by  the  rules  of  language.  Speech  signals  generated 
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by  humans  convey  the  content  of  the  message  as  well  as  the 
identity  and  emotional  state  of  a speaker,  etc.  At  the 
acoustic  level  speech  signals  consist  of  rapid  fluctuations 
of  air  pressure.  The  acoustic  manifestations  of  a speech 
sound  can  vary  substantially  across  different  speakers  or 
even  within  the  same  speaker.  All  variations,  however,  when 
heard  by  a skilled  listener,  may  signify  the  same  linguistic 
content.  It  seems  that  humans  may  use  invariant  features  and 
cues  in  ordinary  spoken  communications. 

The  term  "variability"  has  been  used  in  different 
contexts.  For  example,  for  the  acoustic  speech  signal, 
variability  is  used  to  represent  the  variations  of  the 
speech  signal  characteristics  that  carry  the  same  linguistic 
meaning  to  the  human  listener.  Variability  may  also  be  used 
to  describe  variations  in  perceptual  processes.  In  speech 
perception,  variability  is  used  to  indicate  the  variations 
of  meaning  that  an  acoustic  manifestation  of  an  utterance 
may  have  in  different  contextual  environments.  Variability 
is  also  used  to  indicate  the  variations  in  the  quality  of 
voice . 

While  the  term  "voice  quality"  has  been  used  in 
different  contexts,  it  is  usually  referred  to  as  the  total 
auditory  impression  the  listener  experiences  upon  hearing 
the  speech  of  another  talker  [Childers  et  al.  1987a].  Here 
the  term  "voice  type"  is  used  to  indicate  the  voice  quality 
variations  due  to  the  changes  of  vocal  fold  vibratory 
patterns.  It  is  used  in  the  limited  sense  of  meaning  speech 
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characteristics  related  to  the  voice  source  only,  or  in 
other  words,  the  laryngeal  aspects  of  speech.  Thus  our  main 
concern  is  with  the  vocal  excitation  characteristics  and  the 
acoustical,  perceptual  correlates  of  different  types  of 
phonation  of  human  voice  production. 

There  are  six  types  of  phonation  including  modal, 
vocal  fry,  falsetto,  breathy,  harsh,  and  whisper  [Laver  and 
Hanson,  1981;  Lee  and  Childers,  1989;  Childers  and  Lee, 
1991] . Although  it  is  difficult  to  define  a "modal  (normal) 
voice,"  because  there  are  many  normal  voices  depending  on 
such  speaker  characteristics  as  gender  and  age  [Eskenazi  et 
al.,  1990],  researchers  generally  agree  that  a particular 
voice  type  is  characterized  by  a certain  pattern  of  vocal 
fold  vibrations,  with  the  vocal  folds  approximated  in  a 
similar  way  throughout  a certain  pitch  range  [Boone,  1971; 
Hollien,  1974;  Childers  and  Lee,  1991] . The  vocal  fold 
vibratory  pattern  of  a modal  phonation  is  characterized  by  a 
moderate  frequency,  wide  lateral  excursions,  and  complete 
closure  of  the  glottis  during  about  one  third  of  the  entire 
pitch  period  [Hollien,  1974] . In  vocal  fry  register,  the 
vocal  fold  vibratory  pattern  is  characterized  by  a low 
frequency,  small  lateral  excursions,  and  a long  closed 
glottal  phase  [Hollien,  1974]  . A breathy  voice  is 
characterized  by  an  incomplete  and  lax  vocal  fold 
approximation  during  phonation  [Boone,  1971] . A harsh  voice 
is  characterized  by  aperiodic  vocal  fold  vibrations  [Boone, 
1971;  Moore,  1962;  Eskenazi  et  al., 


1990] . Breathy  and  harsh 
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voices  are  considered  as  pathologic  or  vocal  disorders, 
which  may  arise  from  various  functional  or  organic 
pathologies . 


1.2  Research  Issues 

The  problem  of  speech  variability  is  an  important 
aspect  in  the  design  of  a speech  synthesizer.  In  many 
existing  electrical  synthesizers,  the  properties  of  the 
vocal  fold  source  are  approximated  only  in  a gross  form.  A 
fixed  glottal  wave  shape  whose  amplitude  spectrum  falls  at 
about  -12  dB/octave  is  used.  Thus,  the  spectrum  envelope  is 
perfectly  regular  (i.e.  monotonically  decreasing  at  12  dB 
per  octave) , which  contrasts  with  evidence  indicating  the 
presence  of  zeros  in  the  spectra  of  normal  voicing  waveforms 
[Klatt,  1980  and  1987] . Such  lack  of  fidelity  in  duplicating 
actual  glottal  characteristics  undoubtedly  detracts  from 
both  speech  naturalness  and  the  ability  to  simulate  a given 
voice.  In  order  to  extract  features  that  are  essential 
for  both  synthesizing  various  voice  qualities  and  for 
converting  one  voice  to  another  [Childers  et  al.,  1987a, 
1987b,  1989;  Klatt  and  Klatt,  1990],  variability  in  the 
speech  signals  should  be  characterized  in  some  way. 

In  machine  speech  recognition,  the  problem  of 
variability  in  speech  signals  has  been  a most  challenging 
one  [Klatt,  1986;  Levinson,  1987].  However,  acoustic 
variation  is  apparently  not  a major  problem  to  the  human 
listener.  Listeners  can  cope  with  the  acoustic  variability 
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associated  with  speaker  differences  even  in  the  presence  of 
disturbing  noise.  As  Klatt  (1986)  has  pointed  out, 
significant  progress  could  be  achieved  by  studying  and 
characterizing  variability  in  ways  that  will  lead  to  speaker 
normalization  procedures,  the  design  of  improved  distance 
metrics,  and  rules  that  characterize  the  acoustic 
manifestations  of  allophonic  variants  of  phonemic  elements. 

Because  the  variability  in  speech  signals  are  so 
diverse,  research  efforts  have  been  conducted  that  span 
various  fields  including  linguistics,  psychology, 
physiology,  and  the  physics  of  speech  behavior.  However, 
there  is  no  comprehensive  understanding  of  this  variability 
that  could  lead  researchers  to  a complete  theory  of  speech 
production  and  perception.  One  way  to  study  speaker 
variability  might  be  to  focus  attention  on  voice  quality 
variations,  namely  voice  type  differences  caused  by  changes 
of  the  vocal  fold  vibratory  patterns,  which  show  a great 
deal  of  variability  in  speech  signals. 

To  understand  phonation,  the  physiological,  acoustic, 
and  perceptual  aspects  must  be  examined  [Fant,  1960,  1979, 
1980;  Flanagan,  1972;  Laver,  1980;  Laver  and  Hanson,  1981]. 
From  the  physiological  point  of  view,  the  activities  of  the 
laryngeal  muscles  and  the  vocal  fold  vibration  patterns 
during  phonations  should  be  addressed.  From  the  acoustic 
aspect,  it  is  necessary  to  extract  the  acoustic  correlates 
that  characterize  different  types  of  phonation.  It  is  also 
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required  to  determine  the  perceptual  factors  that  the  human 
auditory  system  uses  to  discriminate  different  voice  types. 

1.2.1  Research  Objectives 

The  variability  of  both  the  glottal  source  and  the 
vocal  tract  characteristics  are  major  sources  of  variability 
in  acoustic  speech  signals.  In  this  research  we  concentrate 
largely  on  studying  the  acoustic  variability  in  speech 
signals  of  various  voice  types  caused  by  the  changes  of 
vocal  fold  vibratory  patterns.  Objectives  of  this  research 
are  are  as  follows: 

1)  to  develop  a statistical  model  based  on  statistical 
analyses  of  a selected  feature  set  of  the  glottal  source,  in 
order  to  find  essential  features  for  various  voice  types. 

2)  to  develop  rules  and  obtain  knowledge  about  different 
voice  types  in  order  to  synthesize  various  voice  types. 

1.2.2  Applications 

The  techniques  and  results  obtained  from  this  research 
will  be  helpful  in  various  ways: 

1)  to  systematically  study  variability  in  speech  signals  so 
that  we  may  obtain  knowledge  about  speaker  adaptation  and/or 
normalization,  which  in  turn  may  be  beneficial  for  speaker 
independent  speech  recognition, 

2)  to  improve  analysis  methods  of  extracting  features  for 
speaker  modeling, 

3)  to  design  a more  sophisticated  synthesizer  in  order  to 
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synthesize  or  convert/create  new  voices, 

4)  to  train  or  tune  an  automatic  speech  recognition  system 
using  synthesized  speech  data  without  having  to  include 
costly  human  subject  training  data. 

5)  to  design  improved  distortion  measures. 

1.3  Sources  of  Variability  in  Speech  Signals 

Variability  in  speech  signals  can  be  classified  into 
three  categories:  intra-speaker  variability,  inter-speaker 

variability,  and  noise.  The  same  speaker  can  produce  the 
same  utterance  differently.  A main  source  of  the 
intra-speaker  variability  could  be  psychophysical  or 
physiological  state  changes  (e.g.,  age,  disease,  and 
emotional  stress)  that  affect  the  organs  related  to  speech 
production  [Hecker,  1971;  Streeter  et  al.,  1983].  For 
inter-speaker  variability,  the  most  notable  differences 
between  speakers  are  in  anatomical  differences  (physical 
dimensions)  of  the  organs  [subglottal  respiratory  system, 
larynx,  oral  and  pharyngeal  cavities,  and  nasal  cavities] 
related  to  speech  production.  Other  sources  of  inter-speaker 
variability  are  learned  and  psychophysical/physiological 
differences.  Potential  sources  of  speaker  variability  in 
speech  production  are  tabulated  in  Figure  1-1.  Noise  in 
speech  signals  can  be  divided  into  two  categories  including 
speech  related  and  speech  independent  noise.  Speech  related 
noise  is  due  to  a speaker' s speaking  behavior,  such  as 
aspiration  or  uttering  meaningless  sounds  during  speaking. 


8 


Figure  1-1.  Relationships  between  potential  sources  of 
speaker  variability  and  their  acoustic  correlates 
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Speech  independent  noise  are  environmental  noises  such  as 
signal  monitoring  characteristics  (microphone,  transmission 
facilities) , additive  background  noise,  competitive  speech, 
etc. 

It  is  difficult  to  clearly  distinguish  between  aspects 
of  speech  production  that  are  anatomical,  those  that  are 
learned,  and  those  that  are  a consequence  of 
psychophysical/physiological  attributes.  Thus,  acoustic 
correlates  of  the  intra-speaker  and  the  inter-speaker 
differences  can  not  be  clearly  dichotomized.  Moreover,  the 
same  acoustic  properties  can  be  affected  by  both  the  source 
and  the  tract  characteristics  due  to  the  source-tract 
interaction  [Stevens,  1972] . Figure  1-2  shows  some  acoustic 
properties  affected  by  intra-speaker  and/or  inter-speaker 
variability  in  the  source  and  the  tract.  As  can  be  seen  in 
Figure  1-2,  there  are  many  acoustic  parameters  related  to 
source  characteristics,  whereas  the  vocal  tract 
characteristics  mainly  affect  the  formant  frequencies  and 
bandwidths,  which  are  most  critical  in  perceiving  speech 
signals . 

Also  it  is  important  to  note  that  any  analysis 
technique  may  introduce  errors  in  some  form.  Thus  care 
should  be  taken  to  avoid  these  artifacts,  and  the  effects  of 
these  artifacts  should  be  controlled  to  be  negligible. 
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Figure  1-2.  Some  acoustic  properties  affected  by 
intra-speaker  and/or  inter-speaker  variability  in  the 
source  and  the  tract. 
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1.4  Previous  Research 

Traditionally  the  problem  of  speaker  variability  has 
been  studied  under  the  area  of  speaker  normalization. 
Speaker  normalization  is  generally  considered  a means  for 
reducing  the  extreme  variability  between  speakers.  All 
attempts  at  speaker  normalization  are  aimed  at  capturing  an 
invariant  description  of  a speaker.  Although  there  is  no 
unified  framework  for  classifying  or  grouping  normalization 
techniques,  it  seems  that  there  exists  a consensus  in 
classifying  normalization  techniques,  which  is  given  as 
follows:  speaker  normalization,  speaking  time  (or  rate) 
normalization,  and  environmental  (or  noise)  normalization 
including  amplitude/energy  compensation  [Klatt,  1986;  Lea, 
1980;  Zue  and  Schwartz,  1980] . 

Efforts  to  achieve  speaker  normalization  usually 
involve  the  following  strategies:  (1)  extraction  and 
classification  of  features  of  speech  signals  that  are 
invariant  across  speakers,  (2)  derivation  and  application  of 
a transformation  of  speech  signals  so  that  the 
idiosyncrasies  of  the  speaker  are  suppressed,  or  (3)  a 
statistical  characterization  of  variability  in  the  signal 
across  large  populations  [Levinson,  1987] . Due  to  the  lack 
of  knowledge  about  speech  production,  a complete  set  of 
invariants  cannot  presently  be  extracted  from  speech 
signals.  The  strategy  which  has  gained  popular  acceptance 
may  be  the  one  that  can  combine  the  known  invariant  cues 
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with  the  "ignorance  models"  [Blumstein,  1986;  Cole  et  al., 
1986;  Klatt,  1986]  . In  practice,  we  presently  depend 
greatly  on  statistical  techniques. 

Time  normalization  techniques  have  been  used  to  reduce 
variability  caused  by  speaking  rate  differences.  Since 
duration  serves  as  an  important  cue  to  both  segmental  and 
supra-segmental  structures  of  an  utterance,  systems  must  be 
able  to  (1)  measure  durations  of  segments  and/or  syllables, 
and  (2)  interpret  duration  deviations  from  expectations  as 
well  as  ignoring  them  when  appropriate  [Klatt,  1986] . 

Environmental  or  noise  normalization  tries  to  reduce 
or  remove  any  irrelevant  components,  which  are  not  related 
to  the  contents  of  speech  signals. 

1.4.1  Vowel/Consonant  Normalization 

Most  speaker  normalization  techniques  proposed  in  the 
past  attempt  to  normalize  vowels.  The  vowels  show  more 
dynamic  characteristics  than  the  consonants,  and  exhibit 
greater  variations  in  length  with  changes  in  speaking  rate. 
A vowel  normalization  technique  is  usually  based  on  a 
modified  formant  space  that  is  composed  by  linear 
combinations  of  the  estimated  formant  frequencies  from  the 
speech  signals  [Kuwabara,  1985;  Miller  et  al.,  1980;  Paliwal 
et  al.,  1983;  Syrdal  and  Gopal,  1986;  Wakita,  1977].  It  was 
found  that  by  using  the  modified  formant  space,  the  same 
vowels  in  various  contexts  show  tendencies  of  being  grouped 
together  and  generally  show  less  speaker  dependency  than  in 
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a simple  formant  space,  e.g.,  (FI,  F2)  space.  In  real  speech 
signals,  however,  it  is  not  easy  to  get  good  estimates  of 
the  formant  frequencies.  Some  methods  that  do  not  require 
the  estimation  of  the  formant  frequencies  have  been 
proposed,  including  frequency  warping  methods  [Blomberg  and 
Elenius,  1986;  Matsumoto  and  Wakita,  1986]  and  methods  that 
apply  the  human  auditory  transformation  to  the  spectrum  of 
the  speech  signals  [Bladon  and  Lindblom,  1981;  Shirai  and 
Kobayashi,  1986;  Suomi,  1984] . But  these  methods  suffer  from 
problems  caused  by  glottal  source  effects  and  algorithm 
sensitivity  to  spectral  peaks. 

At  present,  no  explicit  consonant  normalization  method 
has  been  found,  but  progress  has  been  achieved  in  the  search 
for  speaker-independent  and  context-independent  features 
related  to  the  place  of  articulation  for  the  stop  consonants 
[Blumstein  and  Stevens,  1979;  Kewley-Port,  1983;  Stevens, 
1980;  Walley  and  Carrell,  1983]  . These  analysis  methods, 
however,  largely  depend  on  interactions  of  an  expert,  and 
hence  are  very  difficult  to  automate. 

1.4.2  Time  Normalization 

Time  normalization  may  or  may  not  be  necessary, 
depending  on  the  type  of  analysis  model  used  in  the  speech 
processing  system.  If  features  that  are  invariant  across 
speakers  are  available,  then  time  normalization  is  not 
necessary . When  a non— parametric  model  is  employed  in  speech 
recognition,  regardless  of  the  speech  unit  used,  one  must 
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incorporate  a time  normalization  method  such  as  DTW  (Dynamic 
Time  Warping)  techniques  [Childers  et  al.,  1989;  Levinson, 
1985;  Rabiner  and  Levinson,  1981] . Theoretically,  DTW 
techniques  put  emphasis  on  portions  of  rapid  change  in  the 
speech  signal,  often  corresponding  to  the  occurrence  of  the 
consonants.  When  the  parametric  model  is  used,  it  is  not 
necessary  to  normalize  the  difference  of  the  time  duration 
of  utterances  [Levinson,  1985] . An  example  of  the 
parametric  model  is  HMM  (Hidden  Markov  Model)  with  VQ 
(Vector  Quantization)  [Juang,  1984a] . 


1.4.3  Amplitude/Energy  and  Noise  Normalization 


Amplitude  normalization  of  the  speech  signal  is 
achieved  either  by  analog  automatic  gain  control  circuits  or 
by  digital  means  in  which  the  speech  signal  is  normalized  by 
a factor  determined  from  the  peak  or  average  amplitude  of 
the  signal.  Itakura  (1975)  used  a second-order  inverse 
filter  based  on  the  entire  utterance  to  achieve  amplitude 
normalization.  Noise  normalization  can  be  thought  of  as  a 
part  of  speaker  normalization,  if  the  SNR  (Signal  to  Noise 
Ratio)  is  large  enough.  The  simplest  way  to  normalize  noise 
in  a high  SNR  condition  may  be  a global  energy 
normalization,  that  adjusts  the  sentence  energy  contour  so 
that  the  peak  energy  is  at  or  close  to  0 dB  for  each  word 
[Rabiner  and  Levinson,  1985] . The  global  energy 
normalization  can  eliminate  the  stationary  additive 
background  noise  and  can  adjust  the  word  or  sentence  energy 
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contours.  In  high  noise  environments  or  in  systems  using 
telephone  input  it  will  be  difficult  to  detect  and 
distinguish  between  silence,  weak  fricatives,  stops,  and 
nasals.  Therefore,  systems  must  provide  for  some  form  of 
noise  adaptation  [Reddy,  1976] . 

1.4.4  Automatic  Speech  Recognition 

In  automatic  speech  recognition,  the  purpose  of 
speaker  normalization  is  to  achieve  speaker-independence. 
Approaches  to  speaker  normalization  in  automatic  speech 
recognition  are  quite  different  from  those  for 
vowel/consonant  normalization.  The  major  reason  for  this 
difference  is  that  it  is  difficult  to  identify  boundaries  of 
"phonemes"  from  continuous  speech  signals  to  apply  certain 
vowel/consonant  normalization  techniques.  As  pointed  out  by 
Levinson  (1987)  , in  order  to  achieve  speaker  independent 
processing  techniques,  most  algorithms  use  "ignorance 
models"  in  the  statistical  domain  (e.g.,  clustering 
techniques,  DTW  techniques,  or  HMM,  etc.) 

There  are  several  important  issues  involved  in 
automatic  speech  recognition,  including: 

(1)  Extracting  short-time  spectral  information  from 
speech  signals  (modeling) ; 

(2)  Representing  efficiently  temporal  spectral 
information  at  any  time  instant  (parameter (s)  and/or 
distortion  measure  (s) ) ; 
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(3)  Characterizing  dynamic  spectral  information 
associated  with  the  time-varying  properties  of  speech 
signals; 

(4)  Using  temporal  and  dynamic  spectral  features  to 
measure  the  similarity  (or  dissimilarity)  between  two 
given  running  spectra; 

(5)  Making  template (s)  to  effectively  represent  the 
majority  of  speakers; 

(6)  Incorporating  high-level  information  such  as  the 
structural  and  linguistic  aspects  of  speech,  etc. 

To  achieve  speaker  independence,  care  should  be  taken 
in  every  aspect  of  the  above  issues.  Short-time  spectral 
information  of  speech  signals  is  usually  extracted  through 
an  FFT,  an  LPC  spectral  analysis,  or  a filter  bank  with  or 
without  an  auditory  model.  Although  the  LPC  method  has  some 
drawbacks  (e.g.,  non-separation  of  source  and  tract 
characteristics) , it  has  been  quite  successful  in  many 
automatic  speech  recognition  algorithms  or  systems  [Nocerino 
et  al . , 1985].  Some  researchers  have  investigated  auditory 
models  with  loudness  normalization,  intensity-level  density 
patterns,  non-Euclidean  distance  measures,  and  spectral-peak 
detection  and  temporal  coding  [Bladon  et  al.,  1984;  Blomberg 
et  al.,  1983;  Dautrich  et  al.,  1983;  Hermansky,  1987],  but 
the  recognition  results  were  not  as  good  as  those  for  the 
LPC-based  recognition  schemes.  For  the  auditory  model,  the 
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major  problem  is  the  variations  in  the  glottal  source 
characteristics . 

After  the  analysis  method  is  chosen,  various 
parameters  and  distortion  measures  can  be  attempted.  Among 
them,  the  LPC-based  (generally  weighted)  cepstral  measures 
have  flexibility  for  various  recognition  tasks  and  give 
quite  good  recognition  results  [Nocerino  et  al.,  1985]. 
Once  the  analysis  method  and  the  distortion  measure  are 
chosen,  other  minor  issues  to  be  decided  include  selection 
of  a specific  analysis  method,  the  order  of  analysis,  the 
window  size,  and  the  analysis  frame  rate,  etc. 

Spectral  continuity  or  spectral  movement,  especially 
at  the  spectral  peaks,  provides  significant  information  in 
auditory  perception  of  phonemes  [Childers  et  al.,  1989; 
Furui,  1986;  Juang,  1984b] . Thus,  incorporation  of  dynamic 
(transitional)  spectral  features  helps  to  achieve  speaker 
independence.  Quite  recently  the  use  of  dynamic  spectral 
information,  especially  in  the  cepstral  domain,  was 
attempted  for  speech  recognition  [Aikawa  and  Furui,  1988]  as 
well  as  for  speaker  recognition  [Soong  and  Rosenberg,  1988] . 

In  order  to  characterize  speaker  variability 
statistically,  one  may  take  either  a non-parametric  or 
parametric  approach  [Levinson,  1985] . In  the  non-parametric 
approach,  large  numbers  of  prototypes  of  speech  patterns  are 
collected,  usually  by  clustering,  to  achieve 
speaker-independence.  In  the  parametric  approach,  such  as 
hidden  Markov  modelling,  the  estimation  procedure  for  the 
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parameters  of  the  stochastic  model  of  the  speech  patterns 
can  give  reasonable  speaker-independence,  provided  that 
training  data  were  collected  from  a diverse  speaker 
population. 

If  one  wishes  to  design  a robust  speech  understanding 
system,  one  should  also  pay  attention  to  the  "high  level" 
processes  of  human  perception  [Zue,  1985] . Artificial 
intelligence  and/or  expert  system  approaches  may  greatly 
assist  the  incorporation  of  knowledge  and/or  findings  from 
studies  of  structural  and  linguistic  aspects  of  speech 
perception . 

At  present  there  is  no  "real"  speaker  normalization 
technique,  since  the  variability  in  speech  signals  are  so 
prevalent  and  not  well  understood.  The  most  effective 
solution  to  the  normalization  problem  may  be  the  one  that 
can  combine  various  normalization  techniques  and  optimize 
them  with  respect  to  the  given  task.  To  improve  system 
performance  by  machines  one  must  understand  the  processes  by 
which  variability  arises  and  by  which  listeners  have  learned 
to  ignore  random  acoustic  variation,  yet  attend  to  the 
minute  acoustic  detail  that  provides  information  when 
recognizing  words  from  many  different  speakers.  No  explicit 
estimation  of  glottal  characteristics  have  been  reported. 
Thus  our  approach  to  study  the  variability  of  glottal 
characteristics  can  serve  as  a means  to  understand  the 
variability  in  speech  signals. 
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1.5  Description  of  Chapters 

Chapter  2 describes  the  design  of  this  research.  It 
includes  discussions  about:  (1)  the  speech  production  model 
and  speech  analysis/synthesis,  (2)  research  methods, 
research  issues  and  the  data  base  used,  and  (3)  some  major 
factors  for  speaker  variability  and  voice  type  variations. 

Chapter  3 discusses  glottal  inverse  filtering,  which 
is  the  major  tool  used  to  study  variability  in  glottal  flow. 
Also,  some  characteristics  of  glottal  source  waveforms  are 
discussed. 

In  Chapter  4,  voice  source  models  and  their  estimation 
from  speech  signals  are  discussed.  Results  of  glottal 
inverse  filtering  are  presented.  Included  also  are  the 
results  of  statistical  tests  conducted  on  the  glottal  model 
parameters.  Application  of  the  research  results  in  modeling 
different  voice  type  speech  signals  is  considered. 

Finally  Chapter  5 summarizes  this  research  and 
concludes  with  recommendations  for  future  work. 


CHAPTER  2 
RESEARCH  DESIGN 


A reliable  approach  to  isolate  effects  of  various 
processes  that  cause  variability  in  speech  signals  may  be 
one  that  can  separate  the  source  and  tract  characteristics. 
In  order  to  study  the  effects  of  the  source  and  tract 
characteristics  separately,  it  would  be  a safe  approach  to 
use  a reliable  "synthesizer"  until  perceptual  models  for 
formant  perception  have  been  developed  further  and  validated 
[Holmes,  1986;  Klatt  and  Klatt,  1990;  Seneff,  1982] . This 
approach  to  the  study  of  speaker  variability  requires  a 
formal  model  that  can  assist  researchers  in  a unified  way. 
Childers  et  al.  (1987b)  proposed  such  a model  in  order  to 
assist  researchers  1)  to  create  new  synthetic  voices,  2)  to 
study  factors  responsible  for  synthetic  voice  quality,  and 
3)  to  determine  methods  for  speaker  normalization.  They 
showed  the  effectiveness  of  this  model  by  applying  it  to  the 
synthesis  of  various  voice  qualities  and  to  conversion  of 
one  voice  to  another  [Childers  et  al.  1989]. 

Our  approach  is  based  on  the  model  proposed  by 
Childers  et  al.  (1987b).  The  research  design  for  this  study 
is  shown  in  Figure  2-1.  This  research  is  based  on  the 
analysis-by-synthesis  method.  If  desired,  one  may  add  an 
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objective  measure  process  in  the  evaluation  stage  to  aid 
speech  recognition  system  design. 

To  fully  understand  factors  that  cause  variability  in 
acoustic  speech  signals,  we  shall  describe  how  speech  is 
produced.  Speech  analysis  and  synthesis  will  also  be 
examined  to  obtain  knowledge  for  speech  processing  in 
general . 

2.1  Speech  Production  Model 

Depending  on  the  mode  of  excitations,  speech  sounds 
can  be  divided  into  3 distinct  classes,  namely:  voiced, 

unvoiced,  and  plosive  sounds.  Voiced  sounds  are  produced  by 
forcing  air  through  the  glottis.  The  tension  of  the  vocal 
folds  is  adjusted  so  that  the  vocal  folds  vibrate  as  a 
relaxation  oscillation.  This  produces  quasi-periodic  pulses 
of  air  that  excite  the  vocal  tract.  Unvoiced  sounds  are 
generated  by  forming  a constriction  at  some  point  in  the 
vocal  tract,  usually  toward  the  mouth,  and  forcing  air 
through  the  constriction  at  a velocity  high  enough  to 
produce  turbulence.  This  creates  a broad-spectrum  noise 
source  to  excite  the  vocal  tract.  Plosive  sounds  result 
from  making  a complete  closure  of  the  vocal  tract,  building 
up  pressure  behind  the  closure,  and  then  abruptly  releasing 
it . 

The  acoustics  of  speech  production  are  based  on  the 
concept  of  a cascaded  source  and  a filter  function.  The 
source  of  voiced  sounds  is  represented  by  a quasiperiodic 
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succession  of  pulses  of  air  emitted  through  the  glottis  as 
the  vocal  folds  vibrate.  The  supraglottal  vocal  tract  acts 
as  a filter  that  shapes  the  spectrum  of  the  glottal  flow  to 
produce  different  sounds.  The  phonetic  information  is  mainly 
conveyed  by  the  transfer  function  of  the  filter  representing 
the  vocal  tract  system,  as  proven  by  the  fact  that  voiceless 
articulation  with  an  external  throat  vibrator  can  produce 
intelligible  speech. 

2.1.1  Physical  Model 

The  physical  model  shown  in  Figure  2-2  attempts  to 
simulate  the  actual  physical  processes  involved  in  human 
speech  production.  The  speech  output  is  generated  from  a set 
of  differential  equations  representing  the  states  of  flow 
and  pressure  in  the  combined  subglottal,  glottal,  and 
supraglottal  systems  over  short  time  intervals  [Ishizaka  and 
Flanagan,  1972] . The  lung  pressure  and  its  associated 
muscular  back-up  are  regarded  as  the  source.  A complete 
vocal  tract  network  includes  the  subglottal  system  - the 
lungs,  bronchi,  and  trachea  - and  the  supraglottal  system, 
as  well  as  a finite,  cavity-wall  impedance  and  nasal  system. 

The  main  power  driving  the  mechanical  vibratory  motion 
of  the  vocal  folds  comes  from  the  expiratory  force  as 
represented  by  the  lung  pressure.  Self-oscillation  of  the 
vocal  folds  is  insured  by  appropriate  feedback  of 
pressure/f low  states.  As  long  as  the  flow  is  not  reduced  by 
a supraglottal  constriction  comparable  to  that  at  the 
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Figure  2-2.  Physical  model 
Flanagan  et  al.,  1975] 


of  speech  production.  [From 
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glottis,  the  glottal  area  function  remains  relatively 
insensitive  to  articulatory  variations. 

The  vocal  system  may  be  represented  as  incremental 
contiguous  sections  of  a lossy  cylindrical  pipe.  For 
frequencies  corresponding  to  wavelengths  that  are  long 
compared  to  the  dimensions  of  the  vocal  tract,  (e.g.,  less 
than  about  4000  Hz),  it  is  reasonable  to  assume  plane  wave 
propagation  along  the  axis  of  the  tube  [Rabiner  and  Schafer, 
1978] . At  frequencies  for  which  the  dimensions  of  the  vocal 
tract  are  comparable  to  the  wavelength,  the  approximation  of 
plane  wave  propagation  is  likely  to  be  in  error  by  a 
considerable  amount.  For  most  of  the  non-nasalized  sounds 
in  normal  speech,  however,  the  error  is  typically  quite 
small . 

2.1.2  Electrical  Equivalent  Model  [Flanagan  et  al.,  1975] 

Sound  pressure  and  volume  velocity  for  plane  wave 
propagation  in  a uniform  tube  satisfy  the  same  wave  equation 
as  do  voltage  and  current  on  a uniform  transmission  line. 
The  uniform  transmission  line  can  be  represented  by  an 
equivalent  T-network,  resulting  in  an  equivalent  electrical 
circuit  for  the  vocal  system  as  shown  in  Figure  2-3.  The 
subglottal  and  vocal  tract  systems  for  voiced  sounds 
introduce  resonances,  called  formants,  in  the  frequency 
domain . 

By  limiting  the  model  to  only  the  non-nasalized  voiced 
sounds,  a simplified  electrical  equivalent  circuit  can  be 
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VELUM  NASAL  TRACT  NOSTRIL 


Figure  2-3.  Electrical  equivalent  circuit.  [From  Flanagan 
et  al.,  1975] 
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obtained  using  parallel  resonant  circuits  as  loads.  The 
parameters  of  the  resonant  circuits  are  chosen  to  represent 
the  formants  and  to  approximate  the  impedances  of  the 
subglottal  and  the  vocal  tract  systems.  The  cascade 
arrangement  of  resonators  is  shown  to  yield  vowel  sounds 
having  formants  of  proper  amplitude  when  information 
specifying  the  formant  frequencies  is  known. 

2.1.3  Linear  Source-Filter  Model  [Fant,  1960] 


The  subglottal  system  can  be  thought  of  as  constant, 
because  the  pressure  drop  across  the  bronchi  and  trachea  is 
small  and  thus  the  subglottal  pressure  is  maintained 
relatively  constant  over  the  duration  of  several  pitch 
periods  by  the  low-impedance  lung  reservoir.  Also  the 
glottal  impedance  is  generally  very  large  when  compared  to 
the  vocal  tract  input  impedance,  because  of  the  relatively 
small  opening  of  the  glottis  which  separates  the  subglottal 
and  supraglottal  regions.  The  supraglottal  vocal  tract  acts 
as  a filter  that  shapes  the  spectrum  of  the  glottal  flow  to 
produce  different  sounds.  That  is,  the  source  does  not 
depend  on  the  vocal  tract  shape.  Hence  the  subglottal  system 
can  be  removed  completely  ignoring  the  time-varying  and 
nonlinear  glottal  impedances.  Thus  the  filter  function 
becomes  linear  and  short-time  invariant. 


In  the  linear  filter  model  of  speech  production  (its 
discrete-time  version  is  shown  in  Figure  2-4)  the  speech 
signal  is  modeled 


as  the  output  of 


a linear 
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Figure  2-4.  General  discrete-time  model  of 
production.  [From  Rabiner  and  Schafer,  1978] 


speech 
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quasi-time-invariant  filter  excited  by  quasi-periodic  pulses 
for  voiced  sounds,  or  by  random  noise  for  unvoiced  sounds. 
The  transmission  characteristic  of  the  vocal  tract  is  well 
approximated  by  a cascade  of  uncoupled  resonators  and 
anti-resonators  whose  bandwidths  and  center  frequencies  may 
be  independently  controlled. 

2.2  Speech  Analysis/Synthesis 

A goal  of  speech  analysis  and  synthesis  is  the 
efficient  encoding,  transmission  and  processing  of  speech 
information.  Another  important  purpose  of  speech  analysis 
and  synthesis  is  to  acquire  a basic  understanding  of  the 
speech  communication  processes.  Speech  synthesis  systems 
also  play  a fundamental  role  in  modeling  human  speech 
production . 

Speech  analysis  is  the  process  of  estimating  the 
parameters  of  the  speech  production  model  from  a speech 
signal  that  is  assumed  to  be  the  output  of  that  model. 
Continuous  speech  is  usually  analyzed  by  performing  analysis 
processes  repeatedly  on  short  segments  of  the  speech  signals 
(typically  10-30  msec  durations) , producing  time-varying 
parameters  of  the  model. 

Speech  synthesis  is  the  process  of  producing  an 
acoustic  signal  by  controlling  and  updating  the  speech 
production  model  with  an  appropriate  set  of  parameters.  The 
model  parameters  can  be  obtained  either  by  the  analysis  of 
real  speech  signals  or  by  the  analysis-by-synthesis 
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procedure.  If  the  model  is  sufficiently  accurate  and  its 
parameters  are  accurately  estimated,  the  resulting  output  of 
the  model  is  comparable  to  natural  speech. 

Most  speech  processing  systems  use  bandwidths  between 
3 kHz  (for  telephonic  applications)  and  5 kHz  (for  high 
quality  and/or  research  purposes) . The  perception  of  some 
consonants  is  slightly  impaired  if  the  frequencies  between  3 
and  5 kHz  are  omitted.  Frequencies  above  5 kHz  are  important 
for  speech  clarity  and  naturalness,  but  do  little  for  speech 
intelligibility.  Thus,  it  is  necessary  to  maintain  a balance 
between  speech  quality  and  storage  requirements,  algorithm 
complexity,  and  computation  time,  depending  on  the  specific 
purpose  of  the  speech  processing  system. 

2 . 3 Methodology 

In  the  speech  production  model  it  is  generally 
believed  that  the  glottal  source  characteristics  vary 
relatively  little  (unlike  the  vocal  tract)  across  sentences. 
Moreover,  we  can  create  various  quality  voices  by  changing 
the  glottal  source  model  parameters  (Childers  et  al.  1989]. 
Since  different  voice  types  indicate  the  variations  in  the 
"total  auditory  impression,"  we  examined  the  "averaged" 
characteristics  of  the  source  for  each  voice  type. 
Statistical  analysis  of  the  source  characteristics  of 
various  sustained  vowels  and  voiced  sentences  was  conducted 
in  order  to  develop  a statistical  model  to  be  used  to 
characterize  each  different  voice  type.  Here  the  main 
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analysis  tool  to  obtain  source  characteristics  is  glottal 
inverse  filtering.  The  statistical  model  represents  the 
different  glottal  source  characteristics  in  producing 
different  type  voices. 

From  the  statistical  results  identified  are  sets  of 
"nominal"  values  of  parameters  of  a selected  glottal  source 
model.  These  statistical  models  are  used  to  synthesize 
various  voice  types  and  to  study  variability  in  different 
voice  types.  Moreover,  we  may  use  these  nominal  sets  of 
source  parameters  for  estimating  vocal  tract  parameters  by 
the  analysis-by-synthesis  method.  Figure  2-5  briefly  shows 
the  procedure  of  this  research  methodology.  This 
methodology  can  be  easily  extended  to  analyze  and  synthesize 
various  voice  type  sentences. 

2,3.1  Analysis-by-Synthesis 

To  extract  the  essential  features  that  characterize 
different  type  voices,  we  first  analyzed  speech  signals  to 
estimate  glottal  waves  by  using  inverse  filtering  techniques 
[Krishnamurthy  and  Childers,  1986;  Fujisaki  and  Ljungqvist, 
1986]  . EGG  signals  were  used  to  get  glottal  timing 
information  [Krishnamurthy  and  Childers,  1986] . A formant 
synthesizer  [Pinto  et  al.,  1989]  was  then  adopted  to  verify 
the  estimated  statistical  models  of  the  glottal  source  for 
producing  different  type  voices. 
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. The  Procedure  of  Research 


Figure  2-5 
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2.3.2  Glottal  Inverse  Filtering 

Different  voice  types  are  characterized  by  different 
vocal-fold  vibratory  patterns.  Vocal-fold  vibratory  patterns 
specify  characteristics  of  the  glottal  flow  - the 
volume-velocity  of  the  air  flow  at  the  glottis.  To  determine 
the  acoustic  correlates  of  various  voice  types,  several 
researchers  have  studied  both  vocal  fold  movement  and 
glottal  flow  during  phonation.  Since  the  vocal-folds  are 
located  below  the  pharynx,  it  is  difficult  to  measure 
vocal-fold  movements  and/or  glottal  flow  directly.  Thus,  In 
this  research,  glottal  inverse  filtering  was  used  to  study 
the  glottal  source  characteristics.  Glottal  inverse 
filtering  is  a technique  by  which  the  flow  (called  the 
glottal  volume-velocity)  past  the  time  varying  glottal 
constriction  is  estimated  by  filtering  the  acoustic  speech 
signal.  Filtering  removes  the  effects  of  the  vocal  tract 
resonances  to  reveal  the  underlying  voice-source  signal. 

Almost  all  techniques  used  for  glottal  inverse 
filtering  are  based  on  the  linear  model.  The  source  is 
assumed  to  be  a periodic  waveform  generator  that  outputs 
pulses  of  volume-velocity.  The  volume-velocity  is  the  input 
to  a linear,  time  invariant  vocal  tract  filter.  The  transfer 
function  of  the  vocal  tract  filter  is  determined  by  the 
supraglottal  articulators.  The  output  of  this  filter  is  then 
passed  through  a second  filter  that  models  the  radiation  at 
the  lips,  and  is  finally  output  as  speech.  The  vocal  tract 
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filter  for  vowel  sounds  is  usually  modeled  as  an  all-pole 
filter;  this  can  be  theoretically  justified  on  the  basis  of 
acoustic  tube  modeling  of  the  vocal  tract.  The  inverse  of 
the  vocal  tract  filter  therefore  contains  only  zeros  or 
anti-resonances . If  the  radiated  speech  wave  is  passed 
through  this  inverse  vocal  tract  filter,  the  output  will  be 
the  differentiated  glottal  volume-velocity.  A simple 
integration  of  this  signal  will  yield  the  glottal 
volume-velocity. 

2.3.3  Statistical  Analysis 

Once  the  acoustic  parameters  of  the  glottal  model  were 
obtained,  a statistical  analysis  was  performed  in  order  to 
extract  nominal  values  and  ranges  for  those  parameters.  The 
results  of  the  statistical  analysis  served  as  a basis  for 
synthesizing  various  voice  types.  We  shall  discuss  the  basic 
assumptions  and  procedures  of  the  statistical  analysis  in 
section  2.6. 

2.3.4  Perceptual  Evaluation 

In  synthesizing  various  voice  types,  we  control  a 
chosen  factor  of  the  synthesizer  while  maintaining  other 
factors  constant.  The  results  of  the  listening  tests 
revealed  the  perceptual  correlates  of  the  glottal  factors 
chosen  based  upon  the  statistical  model,  and  provided  useful 
information  about  controlling  source  parameters.  This 
process  helped  us  capture  the  invariant  glottal  source 
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characteristics  of  speech  signals  for  different  speaking 
characteristics . 

2.3.5  Hypotheses 

The  hypotheses  that  this  research  is  based  upon  can  be 
summarized  as  follows: 

1)  Information  related  to  phonemes  resides  in  the  formant 
structure  (formant  frequencies,  bandwidth,  and  amplitudes) 
as  a form  of  resonator  characteristics. 

2)  The  formant  structure  can  be  varied  to  some  extent  in 
that  the  perceived  phonemic  information  remains  the  same. 

3)  For  a given  speaker,  the  glottal  source  characteristics 
apparently  vary  relatively  little,  unlike  the  vocal  tract 
characteristics,  across  sentences. 

4)  Different  voice  types  are  characterized  by  distinctive 
vocal-fold  vibratory  patterns  and  thus  by  different 
characteristics  of  the  glottal  flow. 

2.4  Experimental  Data  Base 
2.4.1  Subjects  and  Tasks 

The  data  base  used  in  this  research  consisted  of 
recordings  of  two  vowels  and  sentences  from  several  speakers 
different  voice  types.  Each  speech  data  record  was 
previously  categorized  by  professional  speech  scientists 
into  one  of  three  different  voice  types  including  breathy, 
vocal  fry,  and  modal. 
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Three  categories  of  subjects  served  in  this  research: 

(1)  normal  subjects  (CKL,  DRW)  who  had  no  history  of  vocal 
disorders  or  laryngeal  pathology,  (2)  patients  with  vocal 
disorders  (EDR,  JMS,  JTO)  whose  voices  were  evaluated  by 
experienced  speech  pathologists,  (3)  experienced  speech 
pathologists  (DMH,  GPM)  . All  subjects  were  male.  Female 
subjects  were  excluded  to  avoid  a gender  factor  in  the 
research. 

The  experimental  tasks  for  each  subject  were: 

(1)  sustained  vowels  /i/  and  /a/  using  a Electro-Voice 
RE-10  microphone,  and  a Bruel  & Kjaer  model  4133  condenser 
microphone, 

(2)  counting  from  one  to  ten  with  comfortable  pitch 
and  loudness, 

(3)  counting  from  one  to  ten  with  progressive  increase 
in  loudness, 

(4)  singing  the  chromatic  scale  using  "la", 

(5)  three  sentences  ("We  were  away  a year  ago.", 
"Early  one  morning  a man  and  a woman  ambled  along  a one  mile 
lane.",  and  "Should  we  chase  those  cowboys?"). 

Tasks  two  through  six  used  the  Electro-Voice  RE-10 
microphone.  Two  speech  pathologists  mimicked  various  voice 
types  (modal,  breathiness,  vocal  fry,  hoarseness)  for  the 
same  tasks.  The  normal  subject  CKL  also  mimicked  vocal  fry 
to  perform  the  task  one.  Table  2-1  lists  the  data  base,  that 
were  analyzed  for  this  study,  for  the  subjects  and  the  tasks 
they  performed. 
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Table  2-1.  The  data  base  for  speech  analysis 


Sub- 

Sex/ 

Phonation 

Data 

Contents 

Micro- 

ject 

Age 

Type 

File 

phone 

DMH 

M/37 

modal  voice 

dmhn003c 

/i/ 

E 

dmhn012c 

/a/ 

E 

dmhn025c 

SI 

E 

dmhn028 

/a/ 

B 

DRW 

M/23 

modal  voice 

drwn003c 

/i/ 

E 

drwn012c 

/a/ 

E 

drwn025c 

SI 

E 

drwn028 

/a/ 

B 

CKL 

M/31 

modal  voice 

modbl 

/i/ 

B 

modb2 

/a/ 

B 

modelc 

/i/ 

E 

mode2c 

/a/ 

E 

DMH 

M/37 

mimicked 
breathy  voice 

dmhpOlOc 

/i/ 

E 

dmhp012c 

SI 

E 

EDR 

M/22 

pathological 
breathy  voice 

edrp003c 

/i/ 

E 

edrp005c 

SI 

E 

edrpOlO 

/i/ 

B 

GPM 

M/79 

mimicked 
breathy  voice 

gpmp003c 

/i/ 

E 

gpmp005c 

SI 

E 

JMS 

M/30 

pathological 
breathy  voice 

jmsp003c 

/i/ 

E 

jmsp005c 

SI 

E 

jmspOlO 

/i/ 

B 

CKL 

M/31 

mimicked  fry 
voice 

frybl 

/i/ 

B 

fryb2 

/a/ 

B 

fryelc 

/i/ 

E 

frye2c 

/a/ 

E 

JTO 

M/21 

pathological 
fry  voice 

jtop003c 

/i/ 

E 

jtop005c 

SI 

E 

• Cl  — 

jtopOlO 

/i/ 

B 

*Contents:  SI  = "We  were  away  a year  ago." 

*Microphone  Type:  E = Electro-Voice  RE-10,  B = B&K4113 
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2.4.2  Data  Collection 

All  data  recordings  were  done  inside  an  Industrial 
Acoustics  Company  (IAC)  single-wall  sound  room.  The  speech 
signals  collected  have  a high  signal-to-noise  ratio  (SNR) 
and  noise  can  be  effectively  ignored.  The  speech  and 
electroglottographic  (EGG)  signals  were  collected 
simultaneously.  A microphone  (an  Electro-Voice  RE-10  dynamic 
cardioid  microphone  or  a Bruel  & Kjaer  (B&K)  model  4113 
condenser  microphone,  depending  on  the  task  recorded)  was 
located  at  a fixed  distance  of  6 inches  from  the  speaker's 
lips.  The  electroglottograph  used  was  a Synchrovoice  Inc. 
model . 

Before  digitization,  the  speech  and  EGG  signals  were 
bandlimited  to  5 kHz  by  anti-aliasing,  passive,  elliptic 
filters  with  a minimum  stopband  attenuation  of  -55  dB  and  a 
passband  ripple  of  ±0.2  dB.  Both  signals  were  then  amplified 
by  a Digital  Sound  Corporation  DSC-240  audio  control 
console.  The  synchronized  speech  and  EGG  signals  were 
directly  digitized  at  a sampling  frequency  of  10  kHz  per 
channel  by  a Digital  Sound  Corporation  DSC-200  A/D  and  D/A 
system  with  16-bit  precision. 

2.4.3  Microphone  Characteristics 

When  the  speech  signal  is  used  for  glottal  source 
estimation,  the  recording  device  should  have  a good 
low-frequency  response.  The  reason  for  this  is  that  the 
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glottal  source  waveform,  which  is  to  be  estimated,  has  its 
major  energy  components  at  low  frequencies  (dc  to  1 kHz) . Of 
the  two  microphones  used  to  measure  the  sound  pressure 
waveforms,  the  B&K  4133  condenser  microphone  has  the  best 
low-frequency  response.  Its  amplitude  response  is  within  ±1 
dB  down  to  20  Hz,  and  its  phase  response  is  linear.  The  -3 
dB  low-frequency  cut-off  is  approximately  10  Hz.  Because  of 
this  good  low-frequency  characteristic,  the  B&K  4133 
condenser  microphone  is  also  sensitive  to  low-frequency 
breath  and  ambient  noise,  which  may  cause  problems  in  speech 
analysis.  Therefore,  the  Electro-Voice  RE-10  microphone  was 
used  to  collect  most  of  the  speech  data. 


The 

Electro-Voice 

RE-10  microphone 

has 

a 

good 

frequency 

response  at 

frequencies  above 

50 

Hz, 

but 

attenuates 

the  low-frequency  components  below 

50 

Hz . 

When 

compared  to  the  B&K  4133  condenser  microphone,  the  obvious 
drawback  of  the  Electro-Voice  RE-10  microphone  is  the  lack 
of  good  low-frequency  response.  Thus,  speech  data  collected 
by  using  the  Electro-Voice  RE-10  microphone  had  to  be 
corrected  to  compensate  for  the  low-frequency  distortion, 
based  upon  the  characteristics  of  the  B&K  4133  condenser 
microphone  [Wong,  1991].  In  this  research  we  used  only 
corrected  speech  data  (if  recorded  through  the  Electro-Voice 
RE-10  microphone)  as  well  as  data  obtained  by  using  the  B&K 
4133  condenser  microphone. 
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2.5  Electroglottograph  (EGG) 

The  ElectroGlottoGraph  (EGG)  is  an  instrument  designed 
to  register  the  vocal  fold  vibration  as  a time-varying 
signal.  The  EGG  measures  the  radio-frequency  (RF)  impedance 
across  the  larynx,  and  hence  the  amplitude  variations  of  the 
EGG  signal  are  generally  thought  to  be  representative  of  the 
area  of  contact  of  the  vocal  folds.  An  objective  of  this 
device  is  to  provide  a measure  of  vocal  fold  activity 
de-coupled  from  the  effects  of  the  supra-glottal  system.  A 
comprehensive  review  about  the  EGG  instrumentation  with  the 
waveform  interpretation  and  the  application  was  given  by 
Krishnamurthy  and  Childers  (1986) . 

A pair  of  electrodes  is  applied  to  the  neck  at  the 
level  of  the  larynx.  A high  frequency  (about  5 MHz)  current 
passes  from  one  electrode  through  the  neck  and  is  picked  up 


by  the 

other 

electrode . 

As  the  subject 

phonates. 

the 

opening 

and 

closing  of 

the  vocal  folds 

changes 

the 

electrical  impedance  of  the  neck  in  the  region  of  the 
electrodes.  This  modulates  the  RF  current,  which  is  then 
demodulated  using  a detector  to  yield  the  EGG  signal. 

The  EGG  indicates  the  electrical  impedance  through  the 
neck  at  the  level  of  the  larynx  and  thus  monitors  variations 
in  vocal  fold  contact  - glottal  closure  is  associated  with  a 
reduction  in  tissue  impedance.  A time  lag  for  the  acoustic 
propagation  delay  from  the  glottis  to  the  microphone  is 
applied  to  the  EGG  signal  when  it  is  compared  with  the 
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speech  signal.  The  magnitude  of  the  EGG  signal  is  inversely 
proportional  to  vocal  fold  contact  area,  so  that  an  increase 
in  amplitude  denotes  glottal  opening.  The  steep  negative 
slope  of  the  EGG  signal  associated  with  glottal  closure 
occurs  over  only  one  or  two  sample  points.  Glottal  opening 
occurs  more  slowly  and  makes  the  true  opening  point  more 
difficult  to  determine  accurately. 

The  EGG  does  not  reflect  a direct  measure  of  glottal 
area.  It  is  postulated  that  the  tissue  impedance  is 
inversely  proportional  to  the  lateral  contact  area  of  the 
vocal  folds.  The  EGG  channel  can  (1)  help  solve  the 
deconvolution  problem  of  inverse  filtering  the  speech 
signal,  (2)  improve  voiced,  unvoiced,  and  silence  detection 
and  fundamental  frequency  estimation,  and  (3)  facilitate 
spectral  estimation  and  formant  tracking  [Krishnamurthy  and 
Childers,  1986] . The  EGG-based  pitch  detection  scheme 
provides  the  pitch  on  a period-by-period  basis.  The 
two-channel  (speech  and  EGG)  speech  analysis  technique 
provides  computational  and  performance  improvements  over 
"speech  only"  analysis  methods  because  of  the  added  EGG 
channel . 


2.6  Comparative  Study  of  Acoustic  Parameters 

In  this  section  some  basic  ideas,  assumptions,  and 
procedures  of  the  statistical  analysis  of  the  data  are 
discussed. 
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2.6.1  One-way  ANOVA  Statistical  Testing 

The  analysis  of  variance  (ANOVA)  is  used  to  determine 
whether  the  sample  means  differ  sufficiently  to  suggest  that 
the  underlying  population  means  differ,  under  the  given 
random  sampling  error.  To  determine  this,  the  ANOVA  computes 
and  compares  two  basic  sources  of  variations:  (1) 
between-group  variation  and  (2)  within-group  variation. 
Between-variability  measures  the  variability  of  a group  mean 
about  the  grand  mean  and  within-variability  measures  how 
much  samples  vary  within  a group.  In  the  context  of  ANOVA,  a 
variance  is  often  called  a mean  square  (MS)  and  is  defined 
as 


SS  variation  (sum  of  squares) 

MS  — ^ * 

df  degrees  of  freedom 

i.e.,  a mean  square  (variance)  is  the  average  variation  per 
degree  of  freedom.  The  rationale  of  the  ANOVA  is  that  the 
total  sum  of  squares  or  variances  of  a set  of  measurements 
composed  of  several  groups  can  be  divided  into  specific 
parts  that  are  identifiable  with  a given  source  of 
variation. 

Under  the  assumption  that  the  groups  that  comprise  a 
series  of  measures  are  random  samples  from  a common  normal 
distribution,  two  estimates  of  the  population  variance  may 
be  expected  to  differ  only  within  the  limits  of  random 
sampling.  This  null  hypothesis  is  tested  by  dividing  the 
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between-group  mean  square  by  the  within-group  mean  square  to 
obtain  the  variance  ratio,  called  the  F ratio.  Our  primary 
interest  is  in  the  possibility  that  the  sample  means  differ 
or  vary  to  a degree  that  cannot  be  reasonably  attributed  to 
random  variation  when  the  null  hypothesis  is  true.  Thus, 
only  if  the  between-group  mean  square  is  larger  than  the 
within-group  mean  square,  will  the  outcome  of  the  experiment 
offer  evidence  against  the  hypothesis  that  the  average 
long-run  values  of  the  means  are  all  equal.  Only  values  of  F 
> 1 will  provide  evidence  against  the  null  hypothesis  of 
interest.  If  the  value  of  F equals  or  exceed  a certain 
value,  then  the  null  hypothesis  that  the  samples  have  been 
drawn  from  the  same  common  normal  population  may  be 
considered  invalid.  Therefore,  the  populations  from  which 
the  samples  have  been  drawn  may  differ  in  terms  of  either 
means  or  variances  or  both.  If  the  variances  are 
approximately  the  same,  it  is  the  means  that  differ 
[Edwards,  1973;  Stevens,  1990] . 

In  summary,  the  ANOVA  is  appropriate  for  comparing  k 
independent  groups  with  a single  dependent  variable.  The 
ANOVA  is  the  generalization  of  the  t test  that  is  used  to 
compare  two  groups.  In  testing  the  null  hypothesis  of  equal 
population  means,  the  ANOVA  computes  and  compares  two  basic 
sources  of  variation  (between  and  within) . 
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2.6.2  Statistical  Discriminant  Analysis  [SAS,  1988] 

The  purpose  of  statistical  discriminant  analysis  is 
either  (1)  to  find  a mathematical  rule  or  discriminant 
function  in  order  to  decide  to  which  class  an  observation 
belongs  based  on  knowledge  of  the  quantitative  variables,  or 
(2)  to  identify  a set  or  a subset  of  combinations  of  the 
quantitative  variables  that  best  reveals  the  differences 
among  the  classes.  Discriminant  analysis  computes  various 
discriminant  functions  to  classify  observations  into  two  or 
more  known  groups  on  the  basis  of  one  or  more  quantitative 
variables.  The  discriminant  function  or  a classification 
criterion  is  determined  by  a measure  of  generalized  squared 
distance.  The  discriminant  analysis  is  different  from  a 
cluster  analysis.  While  all  varieties  of  discriminant 
analysis  require  prior  knowledge  of  the  classes,  usually  in 
the  form  of  a sample  from  each  class,  the  data  in  the 
cluster  analysis  do  not  include  information  on  class 
membership.  The  purpose  of  the  cluster  analysis  is  to 
construct  a classification. 

In  computing  distance  measures  either  a parametric 
method  or  a non-parametric  method  can  be  used.  A parametric 
method  is  appropriate  for  approximately  normal  within-class 
distributions.  The  parametric  method  generates  a linear 
discriminant  function,  based  on  the  pooled  covariance 
matrix,  if  the  within-class  covariance  matrices  are  assumed 
equal.  Otherwise,  a quadratic  discriminant  function  is  used 
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based  on  the  individual  within-group  covariance  matrices.  It 
also  takes  into  account  the  prior  probabilities  of  the 
groups.  When  the  distribution  within  each  group  is  not 
assumed  to  have  any  specific  distribution,  or  is  assumed  to 
have  a distribution  different  from  the  multi-variate  normal 
distribution,  non-parametric  methods  are  used  to  derive 
classification  criteria.  Non-parametric  methods  include  the 
kernel  method  and  the  nearest-neighbor  method. 

In  most  cases  it  would  be  reasonable  to  assume  that 
each  group  has  a multi-variate  normal  distribution  and  thus 
the  parametric  discriminant  analysis  method  should  be  used. 
In  the  parametric  discriminant  analysis  method  the  squared 
distance  from  measured  vector  x to  group  t is  given  by 

dj(\)  = (x  - m^'VrHx  - in,)  (2-2) 


where  x is  a p-dimensional  vector  containing  the  quantitative 
variables  of  an  observation,  t is  a subscript  to  distinguish 
the  groups,  mt  is  the  p-dimensional  vector  of  variable  means 
in  group  t,  Vt  is  St  (the  covariance  matrix  within  group  t) 
if  the  within  group  covariance  matrices  are  used,  or  S (the 
pooled  covariance  matrix)  if  the  pooled  covariance  matrix  is 
used,  and  [']  is  the  matrix  transpose  operator.  The  group 
specific  density  estimate  of  x from  group  t is  represented 


as : 
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f,(x)  = (2jr)  2 |Yj"  i exp(-  0.5  • dj(x)) 


(2-3) 


where  | • | is  the  matrix  determinant  operator.  Using  Bayes' 
theorem,  a posteriori  probability  of  x belonging  to  group  t 
can  be  expressed  as: 


where  pt  is  a priori  probability  of  membership  in  group  t. 
Equation  (2-4)  is  the  conditional  probability  of  t given  x. 
The  generalized  squared  distance  from  x to  group  t is 
defined  by 


where  g\(t)  = logeIStl  if  the  within  group  covariance  matrices 
are  used,  or  g\(t)  = 0 if  the  pooled  covariance  matrix  is 
used,  g2(t)  = -2  loge(pt)  if  a priori  probabilities  are  not  all 
equal,  or  g2(t)  = 0 if  a priori  probabilities  are  all  equal. 
From  equations  (2-4)  and  (2-5),  a posteriori  probability  of 
x belonging  to  group  t is  then  given  as: 


Based  on  the  equation  (2-6)  an  observation  is  classified 


(2-4) 


D}(x)  = dj(x)  + gi(t)  + g2(t) 


(2-5) 


(2-6) 
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into  group  j if  setting  t = j gives  the  largest  value  of 
p(t\x)  or  the  smallest  value  of  Dt2(x). 

2.7  Major  Acoustic  Factors  for  Speech  Variability 

Based  on  the  speech  production  model,  we  divided  the 
acoustic  factors/parameters  related  to  speech  variability 
into  three  categories  including  (Table  2-2)  : (1)  glottal 
source  factors,  (2)  vocal  tract  factors,  and  (3) 
supra-segmental  factors.  Major  glottal  source  factors  are 
fundamental  frequency  (i.e.,  pitch  period)  FO,  glottal  pulse 
timing,  glottal  pulse  shape,  turbulent  noise.  The  formant 
structure  is  the  vocal  tract  factor.  Supra-segmental  factors 
include  pitch  and  intonation  contours,  energy/gain  contour, 
and  temporal  and  dynamic  variations  in  the  speech  spectrum. 
All  these  factors  typically  have  effects  regardless  of  the 
analysis  method (s)  used.  As  can  be  seen  in  Table  2-2,  there 
are  many  acoustic  parameters  related  to  source 
characteristics,  whereas  the  vocal  tract  characteristics 
mainly  affect  the  formant  structure. 

The  shape  and  periodicity  of  the  glottal  source 
waveform  can  vary  considerably.  Analysis  problems  due  to 
variations  in  the  source  characteristics,  such  as  "glottal 
effort,"  are  inevitable  regardless  of  the  method  used  in 
the  speech  analysis  algorithms  proposed  in  the  past  [Bladon 
et  al.,  1984;  Blomberg  et  al . , 1983;  Chen,  1988;  Holmes, 
1986;  Nocerino  et  al.,  1985].  Glottal  air  flow,  as  well  as 
vocal  fold  vibratory  patterns,  may  be  affected  by  changes  in 
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Table  2-2.  Major  acoustic  parameters  (measurements)  re- 
lated to  the  speech  variability. 


ACOUSTIC  PARAMETERS 

• 

PITCH  PERIOD  (FUNDAMENTAL  FREQUENCY) 

• 

GLOTTAL  WAVE  TIMING  FEATURES 

• OPENING  INSTANT 

• CLOSING  INSTANT 

• INSTANT  OF  PEAK  OCCURRENCE 

• 

GLOTTAL  WAVE  SHAPE  FEATURES 

• PULSE  WIDTH  (OPEN  QUOTIENT) 

SOURCE 

• PULSE  SKEWNESS  (SPEED  QUOTIENT) 

• ABRUPTNESS  OF  CLOSURE 

• SLOPE  OF  OPENING 

• 

TURBULENT  NOISE  CHARACTERISTICS 

• 

PITCH  PERTURBATION 

• 

AMPLITUDE  PERTURBATION 

• 

FORMANT  STRUCTURE 

TRACT 

• FREQUENCIES 

• BANDWIDTHS 

• AMPLITUDES 

• 

PITCH  CONTOUR  / INTONATION  PATTERN 

SUPRA- 

• 

ENERGY  OR  GAIN  CONTOUR 

SEGMENTAL 

TEMPORAL  SPECTRAL  FEATURE 
• SPECTRAL  TILT 

• 

DYNAMIC  SPECTRAL  FEATURE 
• SPECTRAL  CONTINUITY* 

* There  is  no  wide  acceptance  on  the  parametric 
definition  of  spectral  continuity.  Here  spectral 
continuity  is  defined  by  the  spectral  transitive 
function  (Appendix  B) . 
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subglottal  pressure  and  supraglottal  articulations  (the 
latter  phenomenon  is  called  source-tract  interaction) . It 
is  not  inconceivable  that  the  glottal  flow  pattern  may 
influence  the  mechanical  motion  of  the  folds  and  thus  of  the 
glottal  area  function.  One  would  expect  a small  shortening 
of  the  closure  time  as  a response  to  a negative  supraglottal 
pressure  peak  preceding  the  termination  of  the  glottal  pulse 
[Rothenberg,  1981;  Fant  and  Ananthapadmanabha,  1982] . 

FO  and  spectral  tilt  (the  spectral  envelope  shape  at 
high  frequency  region  of  the  speech  spectrum)  variations 
have  been  considered  major  disturbing  factors  in  speech 
recognition  [Hermansky,  1987],  while  spectral  peak 
variations  and  dynamic  variations  of  the  speech  spectrum 
have  been  utilized  to  extract  speaker-independent 
information.  FO  and  spectral  peak  variations  can  be 
generally  reduced  by  the  linear  prediction  (LP)  analysis, 
because  the  LP  analysis,  in  principle,  gives  LP  coefficients 
relatively  independent  of  excessive  FO  variations  and 
emphasizes  peak  trends  in  the  speech  spectrum.  Thus,  most 
of  the  LP-based  speech  recognition  algorithms  ignore  the 
influence  of  FO.  But  the  conventional  LPC  spectrum 
inevitably  includes  components  that  are  related  to  vocal 
fold  duty  cycles  and  to  the  talker's  glottal  volume  velocity 
waveshape.  When  the  pitch  period  is  short,  e.g.,  for 
females  and  children,  the  source  characteristics  greatly 
affect  the  first  formant  structure. 
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Spectral  tilt  is  influenced  by  the  individual 
speaker's  glottal  characteristics  or  by  the  lip  radiation 
characteristics.  In  recorded  speech  signals,  spectral  tilt 
can  also  be  affected  by  recording  conditions.  Simple  slope 
compensation  can  be  done  indirectly  using  a cepstral 
distortion  measure  by  deleting  the  first  cepstral  distance 
[Hanson  and  Wakita,  1987]  or  by  applying  first  order 
all-pole  filtering  to  the  auditory  transformed 
autocorrelation  [Hermansky  and  Junqua,  1988] . Weighted 
cepstral  distortion  measures  generally  suppress  the  spectral 
tilt  by  smoothing  the  leading  edge  of  the  weighting  function 
[Juang  et  al.,  1987].  In  the  conventional  LPC  analysis,  it 
is  assumed  that  the  glottal  shaping  model  in  the  speech 
production  model  remains  constant.  Thus,  the  LPC  spectrum 
should  not,  in  principle,  contain  any  variations  related  to 
source  components.  However,  this  is  not  true  for  real 
speech  signals.  The  most  notable  effect  of  the  glottal 
source  difference  is  the  spectral  tilt  variation  between 
speech  spectra. 

The  differences  of  effective  vocal  tract  length 
between  different  speakers  is  a major  problem  if  the 
source-tract  interaction  is  ignored  [Childers  et  al.  1989; 
Levinson,  1987;  Wakita,  1977].  Vocal  tract  length  variations 
are  the  most  difficult  problem  in  speaker  normalization. 
Variations  of  effective  vocal  tract  length  are  inevitable 
and  most  notable  among  speakers  due  to  organic  differences. 
Thus  formant  structures  show  large  variations  across 
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speakers,  and  vary  within  a speaker  due  to  physiological 
conditions  as  well  as  psychological  states.  Although  many 
researchers  have  tried  to  normalize  vocal  tract  length 
variations  from  various  aspects  (e.g.,  formant  frequency 
normalization,  auditory  transformation,  and  frequency 
warping,  etc.),  the  results  have  not  been  satisfactory  due 
to  the  nonlinearity  of  the  speech  production  mechanism.  The 
voice  conversion  technique  proposed  and  tested  for  several 
years  [Childers  et  al.,  1987b,  1989]  is  believed  to  be  very 
useful  in  normalizing  vocal  tract  length  variability, 
especially  between  male  and  female  voices.  While  this 
technique  is  based  on  perceptual  evaluation,  it  is 
necessary  to  incorporate  a quantitative  measure  in  the  voice 
conversion . 

Supra-segmental  problems  arise  from  different  speaking 
styles  and  rates.  Hence  the  pitch  contour  and  the 
energy/gain  contours  vary,  and  spectral  parameters  have 
variations.  These  problems  can  also  arise  because  speech 
parameters  are  analyzed  on  a segment  by  segment  basis.  The 
supra-segmental  information,  such  as  the  dynamic  spectral 
changes,  may  be  used  as  speaker-independent  information  if 
an  algorithm  is  devised  to  use  it.  However,  speaking  rate  is 
one  of  the  most  difficult  parameters  to  treat  and  variations 
in  speaking  rate  affect  nearly  all  acoustic  parameters. 

Abrupt  changes  in  the  speech  analysis  parameters  lead 
to  spectral  discontinuities,  while  energy  (or  gain)  level 
variations  are  perceived  as  acoustic  volume  control 
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distortions  which  are  abrupt  changes  in  speech  intensity. 
Speech  scientists  consider  the  speech  spectral  transition 
pattern  as  a vital  part  of  speech  signals.  Dynamic 
spectral  continuity  might  be  a factor  relating  a 
quantitative  measure  to  the  subjective  evaluation  of  speech 
quality  [Juang,  1984b;  Furui,  1986] . 

In  auditory  perception  of  phonemes,  the  formant 
frequencies  and  their  transitions  are  most  important. 
Moreover,  from  the  results  of  speech  synthesis  and  voice 
conversion  experiments,  it  can  be  said  that  the 
intelligibility  of  speech  signals  appears  to  depend  largely 
on  the  tract  characteristics,  while  the  source 
characteristics  determine  the  quality  of  voice  [Childers  et 
al.  1987a,  b,  and  1989].  The  tract  characteristics  generally 
represent  intelligible  phonemic  information.  In  the  speech 
production  model  it  is  assumed  that  the  tract  can  be  modeled 
as  a resonator,  while  all  other  characteristics,  except  the 
formant  structures,  are  expressed  through  the  glottal  source 
characteristics.  Some  acoustic  characteristics  of  glottal 
factors  of  various  voice  types  are  summarized  in  Table  2-3 
[Lee  and  Childers,  1989;  Childers  and  Lee,  1991;  Lalwani  and 
Childers,  1991].  As  can  be  seen  from  Table  2-3,  different 
voice  types  reveal  a wide  range  of  variations  in  acoustic 
characteristics . 
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Table  2-3.  Summary  of  acoustic  characteristics  of 
glottal  sources  of  various  voice  types. 


Voice 

Param-s\J^Pe 

eter 

MODAL 

FRY 

BREATHY 

ROUGH 

HOARSE 

Fundamental 

Frequency 

medium 

low 

medium 

medium 

medium 

Pulse  Width 

medium 

short 

long 

medium 

medium 

Pulse 

Skewness 

medium 

high 

low 

medium 

medium 

Abruptness  of 
Closure 

medium 

fast 

slow 

medium 

medium 

Turbulent 

Noise 

medium 

low 

high 

medium 

high 

Pitch 

Perturbation 

low 

high 

high 

high 

high 

Amplitude 

Perturbation 

low 

low 

high 

high 

high 

Vocal 

Intensity 

wide 

range 

low 

low 

medium 

medium 

to 

high 

Spectral 

Slope 

medium 

flat 

steep 

medium 

flat 

CHAPTER  3 

VOICE  ANALYSIS  FOR  SOURCE  FEATURE  EXTRACTION 

Our  technique  of  glottal  inverse  filtering,  which  is 
used  to  estimate  the  glottal  flow  waveform  from  the  speech 
signal  is  described  below.  Some  examples  of  glottal  inverse 
filtering  as  well  as  characteristics  of  the  glottal  flow 
waveforms  also  will  be  discussed. 

3.1  Glottal  Wave  Estimation 

In  order  to  study  the  acoustic  characteristics  of  the 
glottal  source  we  need  to  estimate  or  extract  the  glottal 
flow  waveform  from  the  speech  signal.  The  process  of 
estimating  the  glottal  volume  velocity  by  removing  the  vocal 
tract  resonances  from  the  speech  signal  is  known  as  glottal 
inverse  filtering.  The  objective  of  glottal  inverse 
filtering  is  to  determine  the  shape  of  the  vocal  fold 
waveform.  Inverse  filtering  gives  the  true  glottal  flow 
provided  that  the  filter  is  an  exact  inverse  of  the  transfer 
function  from  the  glottal  flow  to  the  speech  wave.  This 
implies  that  the  frequencies  and  the  bandwidths  of  the 
inverse  filter  should  be  set  to  represent  glottal  closed 
conditions.  An  accurate  estimation  of  either  the  glottal 
volume  velocity  or  the  vocal-tract  filter  allows  a 
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determination  of  the  other  quantity,  to  within  the  limits  of 
the  assumed  model. 

3.1.1  Glottal  Inverse  Filtering 

According  to  the  linear  source-filter  theory  of  speech 
production  [Fant,  I960],  speech  can  be  regarded  as  the 
result  of  the  convolution  of  the  glottal  source  waveform  and 
the  vocal  tract  transfer  function.  Figure  3-1  shows  a block 
diagram  representation  of  a linear  speech  production  model. 
The  glottal  waveform  is  denoted  by  ug(n)  and  the  output 
speech  pressure  waveform  by  s (n) . For  voiced  speech  the 
driving  function,  u(n),  to  the  glottal  shaping  model,  G(z), 
is  a train  of  scaled  unit  samples.  For  unvoiced  speech,  the 
gain-adjusted,  white  gaussian  noise  is  fed  to  the  vocal 
tract  filter  directly,  i.e.,  G(z)  = 1.  The  vocal  tract 
model,  V(z),  is  assumed  to  be  an  all-pole  model  [Atal  and 
Hanauer,  1971].  Thus,  the  model  is  only  an  approximation  for 
nasal  sounds,  which  contain  both  poles  and  zeros.  The  speech 
pressure  wave  is  related  to  the  oral  volume  velocity  at  the 
lips  through  a radiation  impedance  R(z),  that  can  be  well 
represented  by  a simple  zero  (a  high-pass  filter)  [Flanagan, 
1972] . Moreover,  the  lip  radiation  impedance  effectively 
remains  the  same  for  different  speech  sounds. 

For  voiced  speech  it  is  possible  to  remove  the  effect 
of  the  glottal  shaping  filter  by  a glottal  inverse  filtering 
technique,  if  the  instant  of  glottal  opening  and  closure  can 
be  located  accurately.  From  Figure  3-1,  voiced  speech,  S(z), 
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GLOTTAL  SHAPING 
FILTER 


VOCAL  TRACT  TRANSFER 

FUNCTION  UP  RADIATION 


DRIVING  FUNCTION  GLOTTAL  VOLUME 

VELOCITY 


5(z) 


s(n) 

SPEECH  PRESSURE 
WAVE 


Figure  3-1.  Block  diagram 
speech  production  model 


representation  of  the  linear 
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is  represented  as: 


S(z)  = U(z)G(z)V(z)R(z)  = Ug(z)V(z)R(z) 


(3-1) 


where 


U,(z)  = U(z)G(z) 


(3-2) 


Thus  glottal  inverse  filtering  can  be  conceptually  defined 
as  solving  for  glottal  volume  velocity  Ug(z),  as  can  be  seen 
in  Figure  3-2  and  Figure  3-3,  by  the  equation 


U,(z)  = 


S(z) 

V(z)R(z) 


(3-3) 


Figure  3-2- (a)  shows  the  relationship  between  the  glottal 
volume  velocity,  ug(n),  and  the  speech  pressure  wave,  s (n) . 

Since  the  lip  radiation  impedance  is  assumed  to  be  the 
same  for  different  speech  sounds,  the  basic  problem  in  the 
estimation  of  the  glottal  volume  velocity  waveform,  ug(n), 
is  to  determine  the  parameters  of  the  inverse  filter, 
1/V(z).  Since  the  speech  production  model  is  linear,  the  lip 
radiation  and  vocal  tract  filters  can  be  interchanged, 
leading  to  the  arrangement  of  Figure  3-2-  (b) . By  combining 
the  lip  radiation  with  the  glottal  excitation,  an  effective 
driving  function,  q(n),  can  be  defined  in  the  form 


qiri)  = ug(n)  * r(n) 


(3-4) 
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S(z) 


s(n) 


S(z) 


s(n) 


1 

l 

R(z) 

V(z) 

(a) 

1 

20) 

1 

V(z) 

R(z) 

U,(z) 


u,(n) 


U,(z ) 


u,(n) 


(b) 


Figure  3-2.  (a)  Block  diagram  representation  of  the 

conceptualized  glottal  inverse  filtering  model. 

(b)  Model  obtained  from  (a)  by  interchanging  the  inverse 
vocal  tract  and  the  inverse  lip  radiation  models. 


VO) 

_ « /M\  * _/M\ 

Figure  3-3.  Equivalent  representation  of  the  linear  speech 
production  model  shown  in  Figure  3-1  in  terms  of  an 
effective  driving  function  and  the  vocal  tract  transfer 
function.  The  symbol  * denotes  the  convolution 
operation . 
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or 


Q(z)  = U,{z)-R{z) 


(3-5) 


where  the  symbol  * denotes  the  convolution  operation.  The 
effective  driving  function  q(n)  is  equivalently  the 
differentiated  glottal  volume  velocity,  since  R(z)  can  be 
modeled  by  a simple  zero. 

Thus  the  linear  speech  production  model  depicted  in 
Figure  3-1  can  be  equivalently  described  by  the  model  in 
Figure  3-3.  In  terms  of  the  effective  driving  function, 
q(n),  the  speech  production  model  can  be  represented  as 


where  ai,  i = 1,  ...,  M,  is  the  coefficients  of  the  vocal 
tract  filter,  V(z),  modeled  by  an  all-pole  filter.  During 
the  interval  of  glottal  closure  the  glottal  volume  velocity 
ug(n)  is  zero,  and  so  is  the  driving  function,  q(n).  Hence 
equation  (3-6)  becomes: 


Note  that  one  sample  after  the  glottal  closure  instant 
(represented  as  nc)  , the  speech  waveform  is  strictly  a 
function  of  the  vocal  tract  resonances  specified  by  a1;  ..., 
aM,  and  the  initial  conditions  s (nc) , ...,  s(nc-M+l).  This 

result  holds  true  over  the  entire  closed  glottal  interval. 


s(n)  = - X s(n~0  + <?(») 


(3-6) 


<>1 


(3-7) 
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The  vocal  tract  filter  parameters  ai,...,aMf  can  be 
estimated  by  the  covariance  method  of  linear  prediction  (LP) 
analysis  [Markel  and  Gray,  1976]  . Before  applying  the 
closed-phase  covariance  LP  analysis,  the  closed  interval 
must  be  identified. 

There  are  some  restrictions  in  performing  glottal 
inverse  filtering  based  on  the  linear  speech  production 
model.  Due  to  the  lip  radiation  impedance  (which  introduces 
a zero  at  zero  frequency)  , the  baseline  for  the  glottal 
volume  velocity  cannot  be  recovered  theoretically.  Low 
frequency  noise  can  also  cause  slight  variations  of  the  base 
line  on  the  measured  laryngeal  waves.  If  a closed  glottal 
phase  does  not  actually  exist,  a unique  glottal 
volume-velocity  cannot  be  determined  from  the  speech 
waveform.  If  zeros  are  present  in  the  vocal  tract  system, 
as  during  nasalized  speech,  the  glottal  volume-velocity 
cannot  be  determined  uniquely  because  the  glottal  source 
zeros  cannot  be  separated  unambiguously  from  the  vocal  tract 
zeros,  if  they  exist.  Thus  in  performing  glottal  inverse 
filtering,  (1)  a closed  glottal  phase  is  assumed,  (2)  an  all 
pole  model  of  the  vocal  tract  is  used,  and  (3)  nasalized 
speech  is  usually  avoided. 

3.1.2  Automatic  Glottal  Inverse  Filtering 

The  elect roglottographic  (EGG)  signal  makes  it  easier 
to  locate  the  glottal  closed  phase  than  is  possible  with  the 
speech  signal  only  [Krishnamurthy  and  Childers,  1986; 
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Childers  et  al.,  1990].  Thus  with  the  aid  of  the  EGG  signal, 
fully  automatic  glottal  inverse  filtering  can  be  done.  We 
used  the  two-channel  analysis  technique  in  which  a 
synchronized,  differential  EGG  (DEGG)  signal  is  used  to 
locate  the  closed  glottal  phase.  Figure  3-4  shows  an  example 
of  EGG  and  DEGG  waveforms  with  the  synchronized  speech 
signal.  As  can  be  seen  in  Figure  3-4,  it  is  not  difficult  to 
obtain  the  glottal  opening  and  closing  instants  from  the  EGG 
signal.  While  the  glottal  opening  occurs  relatively  slowly, 
glottal  closure  is  associated  with  a rapid  reduction  in 
tissue  impedance  and  thus  shows  a large  negative  excursion 
in  the  EGG  signal. 

The  inverse  filter  is  determined  by  a linear 
prediction  covariance  analysis  of  the  closed  phase  region  of 
the  speech  waveform.  The  closed  phase  region  is  first 
identified  based  on  the  EGG  signal.  Then  pitch-synchronous, 
closed-phase,  covariance  linear  prediction  analysis  is 
performed  on  the  closed  phase  region.  The  inverse  filter  is 
obtained  from  the  LP  coefficients  by  selecting  only  the 
valid  poles  and  zeros. 

Since  vocal  tract  resonant  poles  appear  only  as 
complex-conjugate  pairs,  any  real  roots  of  the  polynomial 
should  be  removed.  The  real  pole  at  zero  frequency  will 
typically  occur  due  to  low-frequency  noise  or  a non-zero 
mean  in  the  short  analysis  window,  which  can  be  avoided  by 
high-pass  filtering  the  speech  data.  However,  real  poles  may 
also  occur  when  the  filter  order  is  over  specified.  A real 
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(c) 


Figure  3-4.  (a)  Speech  signal,  (b)  EGG  signal,  and  (c) 

differentiated  EGG  (DEGG)  signal 

* T0  : pitch  period,  OP  : open  phase,  CP  : closed  phase 
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pole  may  also  occur  at  half  the  sampling  frequency.  When 
this  pole  has  a narrow  bandwidth,  it  generally  indicates  a 
formant  location  nearby,  and  thus  should  be  retained.  If  a 
real  pole  occurs  due  to  spectral  shaping  requirements  in  the 
analysis  and  there  is  not  a nearby  resonance,  it  will 
generally  be  of  wide  bandwidth.  Including  such  a pole  in  the 
inverse  filter  will  have  a minimal  effect  on  the  results. 
Therefore,  as  a practical  matter,  any  poles  at  half  the 
sampling  frequency  are  not  removed.  The  resulting  vocal 
tract  filter  is  then  used  to  obtain  the  differentiated 
glottal  flow  by  inverse  filtering  the  speech  signal. 

3.1.3  Implementation 

To  estimate  the  glottal  flow,  an  analysis  algorithm 
was  implemented  based  on  both  the  speech  and  EGG  signals. 
The  overall  analysis  algorithm  for  estimating  the  glottal 
flow  is  summarized  in  Figure  3-5.  A frame  of  the  speech 
signal  is  first  identified  as  voiced  or  unvoiced,  which  is 
determined  from  the  differentiated  EGG  (DEGG)  signals.  For 
the  voiced  frame,  the  bounds  of  the  closed  phase  region  are 
determined  from  the  DEGG  signal.  Multiple  fixed-frame  linear 
prediction  covariance  analysis  is  then  performed  within  the 
bounds  of  the  closed  phase  region.  A set  of  LP  coefficients 
is  chosen  based  on  the  minimum  prediction  error  criterion. 
This  yields  the  inverse  filter  that  gives  the  minimum 
variance  glottal  volume-velocity  waveform  over  the  closed 
phase  region. 
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Figure  3-5.  The  block  diagram  for  the  analysis  algorithm. 
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For  the  voiced  frame,  the  pitch  period,  the  instant  of 
glottal  opening  (equal  to  the  starting  point  of  the  frame) , 
the  location  of  glottal  flow  peak,  and  the  location  and 
magnitude  of  the  negative  minimum  of  the  differentiated 
glottal  flow  are  also  computed.  For  the  unvoiced  frame, 
fixed  frame  autocorrelation  LP  analysis  is  performed. 
Results  of  each  analysis  frame  also  contain  the  starting 
point  of  the  frame,  the  frame  type  (voiced  or  unvoiced) , the 
frame  length  in  number  of  samples,  and  the  power  of  the 
speech  signal  within  the  frame.  Results  of  each  analysis 
frame  is  stored  in  a file  as  a feature  record  for  later  use. 

Normally,  speech  signals  are  not  pre-emphasized  before 
analysis.  However,  for  some  speech  data,  especially  for 
breathy  voices,  an  adaptive  pre-emphasis  was  applied  to 
compute  the  closed  phase  LP  coefficients.  Glottal  inverse 
filtering,  however,  was  done  on  the  original  speech  signal. 
The  adaptive  pre-emphasis  is  based  on  the  estimation  of  the 
general  spectral  tilt  of  speech  signals  analyzed.  It  is 
relatively  easier  to  do  LP  analysis  on  the  pre-emphasized 
speech  signal  than  on  the  non-pre-emphasized  speech  signal, 
when  the  speech  signal  analyzed  has  large  spectral  tilt  at 
high  frequency  region.  The  reason  for  this  is  that 
pre-emphasis  often  reduces  the  ill-conditioning  of  the 
computation  [Markel  and  Gray,  1976] . 

The  general  spectral  tilt  of  a speech  signal  can  be 
characterized  by  at  most  two  real  poles,  which  can  be 
obtained  from  the  successive  first  order  LP  analysis  on  the 
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speech  signal  [Lee  and  Childers,  1989] . The  estimated 
real-poles  provide  a good  approximation  for  the  general 
spectral  tilt  of  speech  signals.  For  most  modal  and  vocal 
fry  phonations  the  general  spectral  tilt  can  be  approximated 
by  a single  real  pole,  while  for  breathy  phonation  two  real 
poles  are  needed  to  characterize  the  relatively  greater 
spectral  roll-off  at  high  frequencies.  Thus  these  pole 
values  can  serve  as  parameters  in  pre-emphasizing  the  input 
speech  signal. 

Voiced  and  unvoiced  segments  in  the  speech  signal  are 
determined  from  the  differentiated  EGG  (DEGG)  signal.  It  is 
known  that  voiced  sounds  have  large  negative  minima  in  the 
DEGG  corresponding  to  the  instant  (s)  of  closure,  so  a 
negative  threshold  is  used  to  locate  these  minima. 
Figure  3-6,  Figure  3-7,  and  Figure  3-8  show  a sentence 
("Should  we  chase  those  cowboys?"  spoken  by  a normal  subject 
DMH)  synchronized  with  the  EGG  and  DEGG  waveforms. 
Figure  3-7  and  Figure  3-8  show  a beginning  and  an  end  part 
of  the  voicing  portions  in  the  sentence.  The  duration 
between  the  minima  of  the  DEGG  waveforms  gives  the  pitch 
period.  Voicing  is  considered  to  start  when  two  successive 
minima  less  than  the  threshold  are  found  and  the  pitch 
period  between  them  is  in  a range  of  25  to  200  samples  at  a 
10  kHz  sampling  rate  (frequency  range  of  50  Hz  to  400  Hz)  . 
When  the  above  two  conditions  are  not  met,  the  corresponding 
segments  are  considered  as  unvoiced. 


[amplitude]  [amplitude]  [amplitude] 


67 


(a) 


10000 


-15000  J . , 

□ 300  BOO  QOO  1200  1500 

t i mo  [ms] 


(c) 


Figure  3-6.  (a)  Synchronized  speech  signal,  (b)  EGG  signal, 

and  (c)  differentiated  EGG  (DEGG)  signal  of  a sentence. 
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Figure  3-7.  (a)  Synchronized  speech  signal,  (b) 

signal,  and  (c)  differentiated  EGG  (DEGG)  signal  at 
beginning  of  voicing  in  the  sentence  shown  in  Figure  3 


EGG 

the 

-6. 


[amp 1 1 tude]  [amp  1 1 1 ude]  [amp 1 1 1 ude] 


69 


5000 
2500 
D 

-2500 
-5000 

1300  1320  1340  1360  1 3BO  1400 


T i mo  [ms] 


(c) 


Figure  3-8.  (a)  Synchronized  speech  signal,  (b)  EGG  signal, 

and  (c)  differentiated  EGG  (DEGG)  signal  at  the  end  of 
voicing  in  the  sentence  shown  in  Figure  3-6. 
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Since  we  use  pitch  synchronous  analysis,  the  analysis 
frame  size  of  a voiced  region  of  speech  signals  is 
equivalent  to  a pitch  period.  The  pitch  period,  T0,  is 
chosen  to  start  at  the  instant  of  glottal  opening,  and  to 
end  at  the  next  instant  of  glottal  opening.  This  helps  to 
fit  a source  model  to  the  differentiated  glottal  flow.  The 
instant  of  glottal  opening  is  determined  as  the  location  of 
the  maximum  between  two  minima  in  the  differentiated  EGG. 
To  accommodate  an  error  in  the  location  of  the  instant  of 
glottal  opening,  the  starting  point  of  a frame  is  actually  a 
few  points  after  the  instant  of  opening.  The  region  between 
a minimum,  which  corresponds  to  the  instant  of  closure,  and 
the  next  maximum  of  the  DEGG  is  considered  as  the  closed 
phase  interval.  When  comparing  the  EGG  with  the  speech 
signal,  the  EGG  is  delayed  by  a time  lag  of  0.9  msec  (9 
samples  at  a 10  kHz  sampling  rate)  in  order  to  compensate 
for  the  acoustic  propagation  delay  from  the  glottis  to  the 
microphone . 

It  has  been  assumed  that  a sufficient  duration  of 
closed  phase,  which  is  determined  from  the  DEGG,  exists  to 
perform  a covariance  LPC  analysis.  If  the  current  closed 
phase  regions  were  too  short  to  permit  an  effective 
covariance  LP  analysis,  we  used  the  LP  coefficients  from  the 
previous  analysis.  This  is  often  the  case  for  breathy 
voices.  If  no  LP  coefficients  were  available,  we  computed  an 
autocorrelation  LP  analysis  over  the  whole  frame. 
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The  default  order  of  the  linear  prediction  analysis  is 
12  and  can  be  adjusted  through  a command  option.  The  minimum 
window  size  of  the  covariance  LPC  analysis  is  28  samples. 
Multiple  fixed-frame  LPC  analyses  are  performed  by  moving 
the  LPC  analysis  window  from  the  beginning  to  the  end  of  the 
closed  region  by  one  sample,  provided  that  the  LPC  analysis 
window  is  confined  within  the  bounds  of  the  closed  phase 
region.  A set  of  LP  coefficients  is  then  selected  based  on 
the  minimum  total  squared  prediction  error. 

To  find  the  inverse  filter,  the  formant  frequencies 
and  bandwidths  of  the  poles  are  calculated  by  factoring  the 
the  chosen  linear  prediction  coefficients.  Real  poles  at  dc 
are  removed,  because  the  vocal  tract  is  assumed  to  have  only 
resonators.  Real  poles  at  the  sampling  frequency,  however, 
are  retained.  Extraneous  formants,  such  as  those  with  very 
low  frequencies  or  very  large  bandwidths,  are  removed.  To 
get  a stable  inverse  filter,  poles  outside  the  unit  circle 
are  reflected  inside  the  unit  circle,  even  though  this  may 
distort  the  glottal  flow  waveforms.  Finally,  the  inverse 
filter  is  then  reconstructed  from  the  remaining  poles. 

To  obtain  the  differentiated  glottal  flow,  the  speech 
signal  within  the  voiced  frame  is  inverse  filtered.  In  order 
to  remove  any  low-frequency  trend  from  the  glottal  flow,  any 
dc  level  of  the  differentiated  glottal  flow  within  the  frame 
is  removed.  The  resulting  differentiated  glottal  flow  is 
then  normalized  to  have  a unit  power  within  the  frame.  For 
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unvoiced  frames,  LPC  residual  inverse  filtering  is 
performed. 

Figure  3-9  shows  an  inverse  filtered  glottal  flow 
waveform  from  a sustained  vowel,  /a/,  phonated  by  a normal 
subject,  DMH . As  another  example,  a sentence  (depicted  in 
Figure  3-6)  from  the  same  subject  was  analyzed  and  results 
are  shown  in  Figure  3-10  and  Figure  3-11.  Note  that  the 
inverse  filtered  glottal  flow  waveforms  from  the  sentence 
contain  unvoiced  or  silent  portions.  These  were  obtained  by 
LP  residual  filtering,  as  can  be  seen  in  the  beginning  and 
the  end  part  of  Figure  3-10  and  Figure  3-11.  Since  we  are 
interested  in  the  voiced  portion  of  the  glottal  waveshape, 
the  unvoiced  or  silent  portions  were  not  considered  further. 
The  inverse  filtered  glottal  waveforms  from  the  sentence 
show  more  cycle-to-cycle  variations  than  do  those  of  the 
sustained  vowel. 

3.2  Glottal  Source  Characteristics 

Some  glottal  waveform  characteristics  of  various  voice 
types  are  summarized  in  terms  of  the  glottal  factors 
important  for  characterizing  several  voice  types  [Lee  and 
Childers,  1989;  Eskenazi  et  al.,  1990]  including:  (1) 
glottal  pulse  width,  (2)  glottal  pulse  skewness,  (3) 
abruptness  of  glottal  closure,  (4)  turbulent  noise,  and  (5) 
glottal  spectral  characteristics.  Some  numerical  data 
measured  from  inverse  filtered  glottal  waveforms  are 
presented  in  the  next  chapter. 
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Figure  3-9.  An  example  of  glottal  inverse  filtering  of  a 
sustained  vowel;  (a)  Normalized  differentiated  glottal 
flow  waveform;  (b)  glottal  flow  waveform  obtained  by 
integrating  (a) . 
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Figure  3-10.  Examples  of  glottal  inverse  filtering  of  a 
sentence;  (a)  & (c)  Normalized  differentiated  glottal 

flow  waveforms;  (b)  & (d)  glottal  flow  waveforms 

obtained  by  integrating  (a)  & (c) , respectively. 
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Figure  3-11.  Continued. 
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3.2.1  Glottal  Waveform  Characteristics 

Normally  a complete  vocal  fold  vibratory  cycle  (during 
voiced  phonation)  consists  of  an  open  phase  and  a closed 
phase.  A wide  variation  of  the  glottal  waveform  shape,  its 
rms  (root  mean  square)  intensity  and  fundamental  frequency, 
phase  spectrum,  and  intensity  spectrum  have  been  reported 
[Sondhi,  1975] . The  differentiated  glottal  flow  waveforms 
obtained  by  inverse  filtering  the  speech  signals  from 
different  voice  types  are  shown  in  Figure  3-12.  We  can  see 
that  the  glottal  flow  waveforms  are  quite  different  for 
different  voice  types. 

The  glottal  flow  waveforms  may  have  negative  values 
depending  on  the  starting  locations  of  the  analysis  frames. 
The  instant  of  opening  is  chosen  as  the  starting  point  for  a 
frame.  Since  the  determination  of  this  instant  from  the 
differentiated  EGG  is  not  very  accurate,  a frame  may  start 
at  the  middle  of  the  opening  phase,  and  the  inverse  filtered 
glottal  flow  waveform  may  be  negative  during  the  closed 
phase.  This  is  most  noticeable  for  the  vocal  fry  voice (s), 
as  can  be  seen  in  Figure  3-12- (c)  and  (d)  . Thus  care  should 
be  taken  in  interpreting  the  opening  instants  from  the 
inverse  filtered  differentiated  glottal  flow  waveforms. 

The  glottal  pulse  width  is  moderate  for  modal  voices 
(Figure  3—12—  (a)  and  (b)  ) and  small  for  vocal  fry  voices 
(Figure  3-12- (c)  and  (d) ) . Breathy  voices  (Figure  3-12- (e) 
and  (f))  have  large  pulse  widths,  often  making  it  appear 
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3 12.  Glottal  flow  and  normalized  differentiated 
glottal  flow  waveforms  for  different  type  voices; 

(a)  & (b)  : modal  voices,  (c)  & (d)  : vocal  fry,  (e)  & 

(f) : breathy  voices. 
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Figure  3-12.  continued. 
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Figure  3-12.  continued. 
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that  there  is  no  closed  phase.  For  all  the  voice  types 
examined  (i.e.,  modal,  vocal  fry,  and  breathy  voices),  the 
closing  phase  exhibits  a steeper  change  of  slopes  than  the 
opening  phase.  Thus  the  glottal  flow  waveforms  are  skewed  to 
the  right.  Glottal  pulse  skewness  varies  with  voice  type. 
For  modal  and  vocal  fry  phonations  the  skewing  is  more 
apparent  than  for  breathy  phonations.  Most  of  the  modal  and 
vocal  fry  voices  show  very  distinct  closed  phases.  The 
closed  phase  is  not  always  apparent  for  breathy  voices,  and 
in  addition  the  glottal  flow  waveforms  are  somewhat 
sinusoidal.  The  glottal  closure  is  relatively  steep  for 
modal  and  vocal  fry  voices,  but  progressive  for  breathy 
voices . 

Due  to  glottal  pulse  skewness,  the  main  excitation 
occurs  at  the  point  of  vocal  fold  closure.  The  magnitude  of 
this  excitation  can  be  controlled  by  the  talker  over  wide 
ranges  [Miller,  1959]  . In  many  cases  there  were  also  well 
defined  instants  of  excitation  of  the  second  and  higher 
formants  at  other  points  in  the  laryngeal  wave  (typically 
these  occur  at  the  instant  of  opening)  [Holmes,  1962].  For  a 
modal  voice,  the  instant  of  the  maximum  closing  slope  occurs 
near  the  instant  of  glottal  closure,  resulting  in  an  abrupt 
termination  of  the  glottal  airflow.  Vocal  fry  shows 
appreciable  excitation  at  the  start  of  the  open  phase  as 
well  as  at  its  end,  and  there  is  often  an  alternation  in  the 
spectral  content  of  the  excitation  from  cycle  to  cycle, 
causing  the  relative  intensities  of  the  formants  to  vary 
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[Lee  and  Childers,  1989;  Hunt,  1987] . For  a breathy  voice, 
the  instant  of  the  maximum  closing  slope  occurs  near  the 
middle  of  the  glottal  closing  phase,  followed  by  a residual 
phase  of  progressive  closure.  A breathy  voice  also  shows 
appreciable  formant  excitation  at  the  middle  of  the  open 
phase . 

3.2.2  Turbulent  Noise  Characteristics 

Most  of  the  inverse  filtered  glottal  waveforms  (from 
voiced  phonation)  show  some  high-frequency  "noise" 
superimposed  on  smooth  waveforms.  This  noise  component  is 
called  "turbulent  noise."  When  the  glottis  has  an  imperfect 
closure  and  the  airflow  rate  is  high,  a turbulent  airflow  is 
produced.  During  vocal  fold  vibration  the  sound  pressure  of 
the  glottal  turbulent  noise  fluctuates  due  to  the  variations 
in  airflow  and  glottal  area.  The  sound  pressure  of  the  noise 
is  approximately  proportional  to  the  square  of  the  volume 
velocity  of  the  airflow,  and  is  inversely  proportional  to 
the  cross  sectional  area  of  the  structure. 

The  amount  of  turbulent  noise  is  relatively  small  for 
modal  and  vocal  fry  voices,  compared  to  the  intensity  of 
harmonics.  For  breathy  voices,  however,  the  high-frequency 
turbulent  noise  in  the  glottal  flow  pulses  is  a noticeable 
feature.  It  is  known  that  the  addition  of  a turbulent  noise 
component  to  the  glottal  flow  generally  increases  the 
naturalness  of  synthesized  speech  [Lee  and  Childers,  1989] . 
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There  might  also  be  a ripple  component  in  the  open 
phase  of  the  glottal  flow  waveform  due  to  the  source-tract 
interaction  [Fant  and  Ananthapadmanabha,  1982] . Within  the 
glottal  open  phase  the  formant  oscillations  can  be 
terminated  by  excessive  damping.  This  causes  a truncation  of 
formant  amplitudes,  changes  the  formant  frequencies,  and 
increases  the  formant  bandwidths  the  during  glottal  open 
phase.  In  synthesizing  speech  signals,  it  is  believed  that 
incorporating  the  effects  of  source-tract  interaction 
produces  more  natural  sounding  speech.  In  this  study, 
however,  the  ripple  component  was  not  considered 
independently  from  the  turbulent  noise. 

3.2.3  Glottal  Spectral  Characteristics 

The  auditory  impression  of  speech  signals  is  closely 
related  to  spectral  features  [Flanagan,  1972;  Holmes,  1973] . 
The  spectral  features  of  speech  signals  are  largely 
determined  by  the  glottal  spectral  characteristics.  It  is 
known  that  the  glottal  waves  of  various  voice  types  can  be 
distinguished  in  the  spectral  domain  by  two  aspects:  (1)  the 
general  spectral  trend,  and  (2)  the  intensity  relationship 
between  the  fundamental  frequency  and  the  higher  harmonics. 
These  spectral  factors  certainly  affect  the  perceived  voice 
quality,  but  are  of  little  significance  for  speech 
intelligibility . 

Figure  3-13  shows  examples  of  differentiated  glottal 
spectra  for  different  type  voices.  The  glottal  spectrum 
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Figure  3-13.  Differentiated  glottal  flow  spectra  for 
different  type  voices; 

(a)  & (b)  : modal  voices,  (c)  & (d)  : vocal  fry,  (e)  & 

(f) ; breathy  voices. 
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Figure  3-13.  Continued 
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Figure  3-13.  Continued 
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characteristics  show  a wide  range  of  variations,  which  means 
that  the  glottal  spectrum  can  be  rarely  characterized  in 
terms  of  a spectral  slope  with  a constant  dB  per  octave 
roll-off  [Monsen  and  Engebretson,  1977]  . Generally,  the 
envelope  falls  off  in  an  irregular  manner.  While  the  glottal 
spectra  for  modal  and  vocal  fry  voices  (Figure  3-14- (a), 
(b) , (c) , and  (d) ) show  moderate  spectral  slopes,  breathy 
phonations  (Figure  3-15- (e) , (f))  show  a relatively  steep 
spectral  slope.  The  sensation  of  vocal  efforts,  i.e., 
hypo-functional  or  lax  vocal  quality,  is  closely  related  to 
the  steeper  spectral  trend,  while  a source  excitation  with 
considerable  energy  at  high  frequencies  results  in 
hyper-functional  or  tense  vocal  quality. 

Due  to  the  high  energy  content,  the  harmonics  at  low 
frequencies  (below  the  first  formant)  are  important  for 
perception  [Holmes,  1973]  . The  glottal  spectra  of  different 
voice  types  show  distinctive  intensity  relations  between  the 
fundamental  and  the  higher  harmonics.  The  vocal  fry  glottal 
waves  show  relatively  strong  harmonics,  while  the  breathy 
glottal  waves  are  characterized  by  a high-intensity 
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CHAPTER  4 

VOICE  SOURCE  MODELING  FOR  DIFFERENT  VOICE  TYPE 

In  this  chapter  glottal  source  models  that  can  be  used 
to  represent  the  glottal  flow  waveforms,  will  be  reviewed. 
Then  a procedure  for  estimating  parameters  of  a source  model 
will  be  discussed.  The  results  of  a statistical  analysis  of 
the  data  from  human  subjects  will  be  reported.  This  data  is 
used  to  develop  a source  model  for  approximating  different 
voice  types.  Applications  of  the  analysis  results  to 
synthesizing  various  voice  types  will  also  be  discussed. 

4.1  Voice  Source  Models 

Based  on  the  linear  speech  production  model,  source 
models  have  been  used  to  represent  the  glottal  flow  pulse 
waveforms.  A source  model  can  be  also  used  to  describe  the 
statistics  of  the  glottal  source  flow  characteristics.  The 
glottal  flow  waveform  has  two  components:  (1)  a residue 

component  and  (2)  a "noise"  component.  The  residue 
component  represents  the  main  pulse  of  the  glottal  flow  and 
can  be  described  by  a smooth  function  with  only  a few 
parameters.  The  noise  component  represents  the  remainder  of 
the  glottal  flow,  it  includes  the  turbulent  noise  and  the 
ripple  component.  While  turbulent  noise  is  mainly  generated 
at  a constriction  in  the  vocal  tract,  the  energy  stored  in 
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the  vocal  tract  contributes  to  the  ripple  component.  A model 
of  the  source  should  be  flexible  so  that  it  may  represent 
various  types  of  glottal  flow  waveforms.  A comparative  study 
of  some  source  models  has  been  given  by  Fujisaki  and 

Ljungqvist  (198  6)  . Here  two  typical  glottal  waveform  models 
are  discussed;  Fant's  model  [Fant,  1979]  and  the  LF  model 
[Fant  et  al . , 1985] . 

4.1.1  Fant ' s model 

The  wave  shape  of  Fant's  model  [Fant,  1979]  consists 

of  rising  and  falling  segments,  as  shown  in  Figure  4-1,  and 

is  represented  by  two  equations: 

Ug(t)  =YC7°[1  “ COSCOgt]  0 ^ fc  Ss  tp  (4-1) 

C7g(t)  = C7o[fCC0SG)g(t-  tp)  -K+  l]  tp  <.  t ^ tc  <.  To  (4-2) 

where  tc  is  the  termination  of  the  waveform.  Given  the 

fundamental  frequency  F0  = 1/T0,  where  T0  is  a pitch  period, 
three  basic  parameters  of  Fant's  model  are  the  peak  flow  UG, 
the  glottal  frequency  Fg  = 0)g/27t,  and  the  asymmetry  factor 
(steepness  factor)  K.  The  rising  segment  reaches  a peak 
value  of  U0  at  t = tp  and  the  falling  segment  has  a value  of 
zero  at  t = tc.  By  applying  these  conditions  to  equations 
(4-1)  and  (4-2),  respectively,  the  relationships  between  the 
time  parameters  and  the  waveform  parameters  can  be  obtained 
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Figure  4-1.  Fant's  model  of  glottal  wave. 
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as  follows: 

n 1 


(4-3) 


(4-4) 


or 


1 

coso)g(tc-  t p) 


The  termination  of  the  waveform  at  t = tc  determines 
the  primary  excitation  of  the  glottal  wave.  By 
differentiating  equation  (4-2)  and  applying  the 
relationships  of  equation  (4-4)  and  of  sin(*)  = (1  - 
cos2(-))1/2,  under  the  condition  of  K > 0.5,  the  termination 
slope  is  given  as 


- U0  • <oJ2K-  1 = - — 

Td 


(4-5) 


where 


Td  (OgJlK- 1 (4-6) 

If  K > 1 this  gives  the  maximum  slope  during  the  course  of 
the  falling  component.  For  0.5  < K < 1 the  maximum  slope 
occurs  prior  to  closure.  At  K = 0.5  the  falling  component 
is  symmetrical  to  the  rising  branch.  The  lower  bound  of  K 
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is  0.5  in  the  present  use  of  the  model  and  represents  a 
lowest  degree  of  excitation  strength.  Generally  the  obvious 
shortcoming  of  a model  with  abrupt  flow  termination  is  that 
it  does  not  allow  for  an  incomplete  closure  or  for  a 
residual  phase  of  progressing  closure  after  the  major 
discontinuity . 

4.1.2  LF-model 

Fant  et  al.  (1985)  proposed  the  LF  model  as  shown  in 
Figure  4-2.  This  model  describes  the  differentiated  glottal 
flow  rather  than  the  glottal  flow  itself.  The 
differentiated  flow  is  commonly  used  in  speech  synthesis, 
and  includes  the  effect  of  radiation  at  the  lips.  The  LF 
model  consists  of  two  segments.  The  first  segment  is  an 
exponentially  growing  sinusoid,  and  the  second  one  is  an 
exponential  decaying  function.  Each  segment  is  expressed  as 
follows : 


where  T0  is  the  pitch-period  interval  within  which  a 
waveshape  of  the  LF  model  is  defined.  At  time  te  both 
segments  have  the  same  value  Ee.  Besides  the  above 


= E(t)  = E0  • eat  sin (ogt 


0 2S  t 52  te 


(4-7) 


te  <.  t ^ tc  <.  To  (4-8) 
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Figure  4-2.  The  LF-model  of  differentiated  glottal  flow 
Ug  (t)  - not  drawn  to  scale. 
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relationships,  there  is  a requirement  of  area  balance  which 
keeps  the  zero  flow  line  from  drifting.  Thus  the  integral  of 
the  LF  model  time-function  through  the  glottal  period  should 
vanish,  i.e, 

f T 

E (t)  = 0 (4-9) 

J 0 


The  three  parameters  of  the  first  segment  of  the 
LF-model  are: 

(1)  E0  which  is  a scale  factor. 

(2)  a = -Btc  where  B is  the  "negative  bandwidth"  of  the 
exponentially  growing  amplitude. 

(3)  cog  = 27tFg  where  Fg  = l/2tp  and  tp  is  the  rise-time  (the 
time  from  glottal  opening  to  maximum  flow) . 

In  the  second  part  of  the  LF  model,  the  parameter  ta  is  the 
time  constant  of  the  exponential  curve  and  is  determined  by 
the  projection  on  the  time  axis  of  the  derivative  at  time 
te,  at  which  the  negative  peak  of  the  LF  model  occurs.  The 
parameter  Ee  is  the  negative  amplitude  of  the  excitation 
spike  at  time  te.  The  parameter  tc  is  the  moment  when 
complete  closure  is  reached.  The  parameter  e is  the  decay 
constant  of  the  recovery  phase  exponential.  The  basic  four 
parameters  E0,  a,  cog(  and  £ are  called  the  "direct  synthesis 
parameters"  of  the  LF  model,  while  the  time  parameters  tp 
te,  t3f  tc  are  called  the  "timing  parameters".  Each  LF  model 
timing/direct-synthesis  parameter  can  be  thought  of  as 
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independent  of  one  another,  because  a unique  combination  of 
timing/direct-synthesis  parameters  can  generate  a unique 
waveshape . 

The  first  segment  of  the  LF  model  represents  the 
differentiated  flow  from  glottal  opening  to  the  instant  when 
the  main  excitation  occurs  (the  moment  of  maximum 
discontinuity  in  the  glottal  airflow  function,  which 
normally  coincides  with  the  moment  of  the  maximum  negative 
flow  derivative)  . The  second  part  of  the  model  is  an 
exponential  segment  that  allows  a residual  flow  (dynamic 
leakage) , from  the  point  of  maximum  closing  discontinuity  at 
time  te  towards  maximum  closure,  when  the  vocal  folds  close 
at  time  tc.  The  effect  of  the  return  phase  on  the  source 
spectrum  is,  due  to  its  exponential  waveshape,  approximately 
a first  order  low-pass  filter  with  a cutoff  frequency  Fa  = 
1/  (2nta)  [Fant  and  Lin,  1988].  This  means  that  the  longer 
the  return  phase,  the  lower  the  cutoff  frequency,  and  the 
greater  the  reduction  of  the  high  frequency  energy. 

The  LF  model  time-function  is  generated  by  using  the 
direct  synthesis  parameters,  i.e.,  E0,  a,  0)g(  and  e.  However, 
for  many  research  applications,  such  as  model  fitting  to 
inverse  filtered  glottal  flow  waveforms,  it  is  easier  to 
specify  the  timing  parameters  - tp#  te,  ta,  tc  _ and  Ee  rather 
than  the  direct  synthesis  parameters.  The  direct  synthesis 
parameters  can  be  easily  computed  from  the  timing  parameters 
and  Ee . The  procedure  to  obtain  the  corresponding  direct 
synthesis  parameters  from  the  timing  parameters  and  Ee  is  as 
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follows : 

(1)  The  intermediate  parameter  e can  be  determined  by  an 
iterative  procedure  from  equation  (4-8)  by  letting  t = te, 
i.e.,  from 

eta  = 1 - e~e(tc~te)  (4-10) 


For  small  values  of  ta,  e is  approximately  equal  to  l/ta. 

(2)  By  definition,  0)g  = ft/tp. 

(3)  The  solution  for  the  parameter  a can  be  obtained  by 
applying  the  area  balance  constraint  of  equation  (4-9)  with 
the  solution  of  E0  from  equation  (4-7) 


_Ee 

sin<ygte 


(4-11) 


In  estimating  the  LF  model  parameters,  the  parameter 
tc,  which  represents  the  closing  instant,  is  usually  set  to 
T0,  the  time  of  glottal  opening  for  the  following  pulse 
period.  This  implies  that  the  model  may  lack  a closed 
phase.  In  practice,  however,  for  small  values  of  ta  the 
exponential  function  of  the  second  part  of  the  LF  model  will 
have  negligible  value  and  thus  provides  an  effective  closed 
phase . 

The  LF-model  function  is  continuous  until  the  main 
excitation,  and  therefore  does  not  introduce  additional 
excitation  at  the  flow  peak.  In  comparison,  Fant's  model 
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consists  of  two  different  segments;  a rising  segment  up  to 
maximum  flow  and  a falling  segment  down  to  complete  closure. 
The  discontinuity  between  the  two  segments  introduces  a 
secondary  weak  excitation  at  the  flow  peak.  The  major 
difference  between  these  two  models  is  that  the  LF  model 
allows  for  a residual  phase  of  progressive  closure,  while 
the  Fant  model  always  generates  an  abrupt  closure.  The 
existence  of  the  residual  closing  phase  in  the  LF  model 
gives  the  flexibility  of  modeling  various  voice  types  more 
efficiently . 

In  summary  the  LF  model  is  a good  approximation  for 
non-interactive  flow  parameterization  in  the  sense  that  it 
ensures  an  overall  fit  to  commonly  encountered  glottal  flow 
wave  shapes  with  a minimum  number  of  parameters  [Fant  et 
al.,  1985].  It  is  flexible  in  its  ability  to  match  various 
phonations,  for  example,  breathy  voice.  The  four  glottal 
factors  important  for  characterizing  several  voice  types 
[Lee  and  Childers,  1989]  are:  (1)  glottal  pulse  width,  (2) 
glottal  pulse  skewness,  (3)  abruptness  of  glottal  closure, 
and  (4)  turbulent  noise.  The  first  three  glottal  factors  can 
be  modeled  effectively  with  the  LF  model.  The  fourth  factor 
should  be  incorporated  separately  to  provide  a complete 
model.  In  this  research  the  LF  model,  with  an  accommodation 
for  turbulent  noise,  was  used  to  parameterize  the  waveform 
characteristics  of  the  glottal  flow. 
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4.2  Modeling  of  the  Glottal  Flow  Waveform 

For  voiced  sounds,  the  closed  phase  vocal  tract 
parameters  were  calculated  from  the  closed  phase  covariance 
LPC  analysis.  Then  the  glottal  flow  waveform  was  obtained 
by  inverse  filtering.  A source  model  was  fitted  to  the 
glottal  flow  waveform  to  obtain  the  glottal  source 
parameters.  The  glottal  opening  instant  and  the  duration  of 
the  glottal  open  phase  as  well  as  the  closing  instant  and 
the  closed  phase  duration  were  determined  from  the 
differentiated  EGG  signals. 

4.2.1  Measurement  of  Model  Parameters 

We  used  the  LF  model  to  extract  the  parametric 
features  of  the  glottal  flow  waveform.  When  fitting  the  LF 
model  to  inverse  filtered  differentiated  glottal  flow 
waveforms,  the  parameter  tc,  which  represents  the  closing 
instant,  is  assumed  to  be  equal  to  T0,  the  time  of  glottal 
opening  for  the  next  pulse  period.  In  this  study,  we  defined 
the  parameter  tc  as  the  instant  at  which  the  modeled 
differentiated  glottal  flow  amplitude  drops  to  1%  of  its 
peak  value  and  was  computed  from  the  matched  model  waveform. 
Thus,  the  parameter  tc  does  not  represent  the  actual  closing 
instant  of  the  LF  model,  rather  it  can  be  considered  as  a 
settling  time  of  the  model.  We  used  the  parameter  tc 
approximate  to  the  closing  instant  of  the  LF  model. 


as  an 


98 


Figure  4-3  shows  the  block  diagram  for  the  algorithm 
that  matches  the  LF  model  to  the  measured  differentiated 
inverse  filtered  waveform.  Within  an  analysis  frame 
(corresponding  to  one  pitch  period)  of  the  inverse  filtered 
differentiated  glottal  flow  waveforms,  the  values  of  te  and 
Ee  parameters  are  easily  measured.  The  parameter  ta  is 
determined  by  a least  square  error  (LSE)  criterion  between 
the  inverse  filtered  differentiated  glottal  flow  waveform 
and  the  closing  part  of  the  LF  model  given  by  equation 
(4-8) . Then  possible  candidates  for  the  parameter  tp  are 
located  in  the  range  from  the  instant  of  the  glottal  opening 
to  the  instant  te.  For  each  candidate  of  tp/  the  direct 
synthesis  parameters  - E0,  a,  cogf  and  e - are  calculated. 
Based  on  the  direct  synthesis  parameters  obtained,  the 
modeled  differentiated  glottal  flow  waveforms  are  generated 
and  the  total  squared  errors  are  computed.  Finally  the 
parameter  set  which  gives  the  minimum  total  squared  error  is 
selected  as  the  best  matching  LF  model  for  the  frame. 

Examples  of  the  inverse  filtered  glottal  flow  and 
modeled  glottal  flow  waveforms  of  a sustained  vowel,  /a/, 
and  voiced  portions  of  a sentence,  "Should  we  chase  those 
cowboys?"  are  shown  in  Figure  4-4  and  Figure  4-5, 
respectively.  It  is  not  easy  to  estimate  accurately  the 
opening  instants  for  the  modeled  differentiated  glottal  flow 
from  the  differentiated  EGG.  Thus,  a less  than  perfect  match 
between  the  real  and  modeled  differentiated  glottal  flow  may 
result.  A possible  solution  is  to  adjust  the  model  opening 
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Figure  4-3.  The  block  diagram  for  the  LF  model  matching 
algorithm. 
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(a) 
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Figure  4-4.  LF-modeled  data  for  the  vowel  /a/  (DMH)  : (a) 

Normalized  differentiated  glottal  flow,  <b)  Glottal 
flow. 


[spnji  iduio]  [spninduo]  [epmi  ictao]  [Bpnindno] 
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(b) 


(c) 


Figure  4 5.  LF  modeled  data  for  voiced  portions  of  the 
sentence,  "Should  we  chase  those  cowboys?"  (DMH) : (a)  & 

(c)  Inverse  filtered  glottal  flow,  (b)  & (d)  Modeled 

glottal  flow. 
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instant  until  an  optimum,  in  the  sense  of  the  minimum  total 
squared  error,  is  found.  This  procedure,  however,  requires  a 
considerable  amount  of  computing  time. 

The  timing  parameters  of  the  LF  model  are  closely 
related  to  the  glottal  waveshape  factors,  for  example,  tc  to 
glottal  pulse  width,  ta  to  abruptness  of  glottal  closure,  te 
to  the  instant  of  the  main  excitation  during  glottal  open. 
To  describe  quantitatively  the  waveshape  characteristics  of 
the  modeled  glottal  flow,  we  defined  the  open  quotient  (OQ) 
and  the  speed  quotient  (SQ)  for  the  LF  model  waveshape.  The 
open  quotient  (0 Qlf)  of  the  LF  model  waveshape  is  defined  as 
the  ratio  of  the  open  phase  to  the  pitch  period,  i.e., 


OQlf  - 


open 

pitch 


phase 

period 


tc 

T0 


(4-12) 


The  range  of  values  for  the  open  quotient  is  from  0 (no  open 
phase)  to  1 (no  closed  phase) . The  speed  quotient  (SQlf)  of 
the  LF  model  waveshape  is  defined  as: 


SQlf  - 


opening  phase 
closing  phase 


(4-13) 


Glottal  pulse  skewness  is  commonly  represented  by  the  speed 
quotient.  Values  of  the  speed  quotient  can  range  from  0 (no 
opening  phase)  to  infinity  (no  closing  phase)  . Both  extreme 
values  of  the  speed  quotient  cannot  occur  due  to  the 
physiological  limitation  of  human  articulatory  movements. 
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Note  that  in  equation  (4-12)  and  (4-13),  the  computed  tc  is 
used  to  approximate  the  closing  instant.  The  measured  data 
for  both  the  OQ  and  the  SQ  from  the  matched  LF  model 
waveforms  are  reported  in  a later  section. 

4.2.2  Estimation  of  Turbulent  Noise 

It  is  generally  known  that  the  incorporation  of  the 
turbulent  noise  in  the  glottal  source  model  will  enhance  the 
naturalness  of  synthesized  speech  signal  [Klatt,  1987;  Lee, 
1988] . Most  of  the  conventional  glottal  source  models, 
including  the  LF  model,  however,  do  not  have  a provision  to 
incorporate  the  turbulent  noise  component,  because  of  the 
difficulties  in  estimating  and  modeling  the  noise  component 
from  glottal  flow  waveforms  or  from  the  speech  signal.  One 
may  use  the  harmonic-to-noise  ratio  (HNR)  to  measure  the 
signal-to-noise  ratio  (SNR)  and  thereby  estimate  the  noise 
component.  In  this  study  we  adopted  a frequency  intensity 
ratio  [Frokjaer-Jensen  and  Prytz,  1976]  as  a means  of 
approximating  the  turbulent  noise  component  of  the 
differentiated  glottal  flow. 

Several  methods  for  estimating  the  HNR  from  the  speech 
signal  have  been  proposed  in  both  the  time-domain  and  the 
frequency-domain  [Yumoto  et  al.,  1982;  Hiraoka  et  al., 
1984]  . Although  the  HNR  has  been  used  successfully  for 
characterizing  pathological  voices,  especially  hoarse 
voices,  the  conventional  methods  to  estimate  the  HNR  from 
the  speech  signal  have  a limitation;  any  measurement  method 
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for  HNR  cannot  avoid  influences  from  the  near-by  formants, 
which  may  have  relatively  large  energy  near  harmonics  of  the 
fundamental  frequency.  Lee  [1988]  reported  HNR  values 
estimated  for  different  type  voices.  He  used  a 
frequency-window,  which  was  controlled  by  the  fundamental 
frequency,  to  pick  out  harmonic  components  from  the  speech 
signal . 

One  may  also  use  a modeled  signal-to-noise  ratio  (SNR) 
to  estimate  the  noise  component.  The  modeled  SNR  can  be 
defined  as: 

. , . mdg  power 

Modeled  SNR  = (4-14) 

dg  power  - mdg  power 

where  dg  and  mdg  represent  the  inverse  filtered 
differentiated  glottal  flow  and  the  modeled  differentiated 
glottal  flow,  respectively.  The  modeled  SNR  represents  the 
relative  intensity  of  the  model  function  to  the  intensity  of 
the  noise  component  in  the  true  glottal  flow.  The  drawback 
of  the  modeled  SNR  is  that  it  can  be  used  only  for  a 
specific  glottal  model.  Moreover,  it  is  very  sensitive  to 
the  time  mis-alignment  of  the  model  waveform. 

Frokjaer-Jensen  and  Prytz  (1976)  defined  a parameter, 
a,  to  measure  the  intensity  ratio  between  the  lower  and 
higher  frequency  regions  of  the  speech  signal.  They  reported 
that  the  parameter  a was  a good  acoustic  correlate  of  the 
physiological  term  "vocal  fold  medial  compression,"  i.e.,  a 
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voice  with  low  vocal  effort  has  a low  a-value,  while  a 
voice  with  high  vocal  effort  has  a high  a-value.  The  major 
drawback  of  this  parameter  is  that  its  value  is  greatly 
affected  by  the  specific  formant  structure  of  the  particular 
vowel  sample.  To  overcome  this  problem  they  computed 
a-values  for  long-time-average-spectra  (LTAS) , though  this 
approach  requires  considerable  computation. 

In  this  study  we  computed  the  a-value  of  the  inverse 
filtered  differentiated  glottal  waveform,  i.e., 

intensity  above  1 KHz 

a = (4-15) 

intensity  below  1 KHz 

Since  the  a-value,  called  the  frequency  intensity  ratio,  is 
computed  from  the  inverse  filtered  differentiated  glottal 
waveform,  the  influence  of  the  formants  can  be  avoided 
completely,  provided  that  inverse  filtering  is  accurately 
performed.  In  the  strict  sense,  the  a value  does  not 
estimate  the  turbulent  noise  in  the  glottal  source.  It  can, 
however,  be  used  as  an  approximation  for  the  turbulent  noise 
component  because  in  the  glottal  flow  waveforms  the 
signal-to-noise  ratio  (SNR)  in  the  low-frequency  range  is 
much  higher  than  in  the  high-frequency  range  due  to  the 
strong  first  harmonic  (usually  less  than  1 kHz)  of  the 
fundamental  frequency. 

Table  4-1  summarizes  the  mean  a-values  in  dB  and  its 
standard  deviations  (std)  for  the  different  type  voices 
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analyzed  in  this  study.  As  can  be  noted  from  Table  4-1, 
breathy  voices  have  a relatively  strong  low-frequency 
component  in  their  glottal  flow  waveforms  as  compared  to 
modal  and  vocal  fry  voices.  This  is  because  the  glottal  flow 
waveforms  of  breathy  voices  are  relatively  sinusoidal  and 
the  low  harmonics,  especially  the  first  one  or  two,  are 
stronger  than  others.  The  ranking  of  voice  type  according  to 
increasing  mean  a-value  was  vocal  fry,  modal,  and  breathy. 
From  the  results  of  the  mean  a-values  it  can  be  concluded 
that  a breathy  voice  requires  the  least  vocal  effort  while 
the  vocal  fry  type  voice  requires  the  most  effort.  Moreover, 
although  the  high-frequency  turbulent  noise  in  the  glottal 
flow  waveforms  is  a good  feature  for  the  breathy  voice 
[Holmes,  1973;  Lee  and  Childers,  1989],  its  relative 
intensity  is  low  due  to  high-intensity  fundamental 
component . 


Table  4-1.  Mean  a-values  and  standard  deviations  (std) 
for  different  type  voices 


Phonation  type 
(number  of  measurements) 

modal 

(1294) 

vocal  fry 
(769) 

breathy 

(1323) 

mean 

a-values  [dB] 
(std) 

-11.11 

(4.42) 

-8.99 

(4.91) 

-19.62 

(10.13) 
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4.2.3  Estimation  of  Spectral  Tilt 


The  general  spectral  tilt  or  slope  for  a voiced 
phonation  is  determined  by  the  combined  contribution  of  the 
spectrum  of  the  glottal  pulse  and  the  lip  radiation.  The 
general  spectral  tilt  of  a glottal  flow  can  be  represented 
by  a transfer  function  of  a low-pass  filter  with  multiple 
real  poles.  While  the  glottal  spectral  characteristics  for 
modal  and  vocal  fry  voices  can  be  modeled  well  by  a two-pole 
model  (-12  dB/octave),  an  extra  pole  is  usually  required  for 
breathy  phonations  [Lee  and  Childers,  1989,  Klatt  and  Klatt 
1990] . The  extra  pole  yields  a steeper  spectral  slope  of  -18 
dB/octave  for  the  three-pole  model.  We  adopted  a three-pole 
model  to  estimate  the  general  spectral  tilt  of  a glottal 
volume  flow: 


Ug(z) 


K 

(1  - zaz-1)  (1  - zbz-1)  (1  - zcz-1) 


(4-16) 


where  K is  a constant  related  to  the  amplitude  of  the 
glottal  flow  and  za,  zb,  and  zc  are  real  poles  inside  the 
unit  circle  in  the  z-domain.  Here  we  represent  the  spectral 
tilt  of  the  voiced  phonation  with  these  pole  values. 

Since  we  deal  with  the  differentiated  glottal  flow 
waveforms,  it  is  implicitly  assumed  that  the  value  of  za  is 
equal  to  1,  or  equivalently,  the  input  glottal  wave  is  first 
pre-emphasized  by  a high-pass  filter  of  the  form: 
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-l 


(4-17) 


The  effect  of  this  procedure  is  also  equivalent  to  including 
lip  radiation  into  the  glottal  shaping  filter  of  the  speech 
production  model.  To  estimate  the  values  of  the  other  real 
poles  of  equation  (4-16),  an  ordinary  linear  prediction  (LP) 
analysis  method  cannot  be  directly  applied  because  the  poles 
are  restricted  to  have  real  values  only.  We  used  an  adaptive 
pre-emphasis  method  that  carries  out  a series  of  first-order 
LP  analyses  on  the  differentiated  glottal  flow  waveforms 
[Lee,  1988]  . 

The  glottal  spectral  model  generally  approximates  the 
high-frequency  spectral  trend  better  than  the  low-frequency 
characteristics  of  the  glottal  pulse.  Thus,  if  an  all-pole 
model  is  used  for  the  glottal  flow,  it  would  be  more  likely 
to  cause  a mismatch  in  the  low-frequency  region.  In  this 
study  the  all-pole  model  was  not  used  as  a glottal  flow 
model,  rather  it  was  used  to  estimate  the  spectral  tilt  of 
the  glottal  flow  waveform.  The  estimated  spectral  tilt  was 
then  used  to  adapt  the  LF  model,  which  was  matched  in  the 
time-domain,  based  on  the  estimation  of  the  spectral  tilt  of 
the  inverse  filtered  differentiated  glottal  flow  waveform. 

Table  4-2  shows  the  coefficients  of  the  real-pole 
glottal  model  estimated  for  different  type  voices  from  both 
the  inverse  filtered  and  the  LF  modeled  differentiated 
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glottal  flow  waveforms.  The  ranking  of  voice  type  according 
to  increasing  spectral  tilt  was  vocal  fry,  modal,  and 
breathy.  This  ordering  is  the  same  as  that  for  the 
high-to-low  frequency  intensity  ratios  (or  the  a-values) . 
Thus,  the  larger  the  spectral  tilt,  the  steeper  the  spectral 
slope  and  the  less  the  vocal  effort.  Our  data  also  show  that 
the  spectral  tilt  estimated  from  the  LF  modeled 
differentiated  glottal  flow  waveform  is,  on  the  average, 
larger  than  that  from  the  inverse  filtered  waveform. 


Table  4-2.  Mean  values  of  spectral  tilts  and  standard 
deviations  (std)  estimated  for  different  type  voices  from 
both  the  inverse  filtered  differentiated  glottal  (DG)  flow 
and  the  LF  modeled  differentiated  glottal  (MDG)  flow 
waveforms . 


Phonation 

type 

(number  of 
measurements) 

from  DG 

from 

MDG 

Zb 

(std) 

zc 

(std) 

Zb 

(std) 

zc 

(std) 

modal 

0.884 

0.070 

0.959 

0.372 

(1294) 

(0.207) 

(0.195) 

(0.049) 

(0.338) 

vocal  fry 

0.797 

0.047 

0.941 

0.198 

(769) 

(0.269) 

(0.160) 

(0.064) 

(0.294) 

breathy 

0.887 

0.396 

0.978 

0.690 

(1323) 

(0.299) 

(0.339) 

(0.042) 

(0.241) 
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4.2.4  Model  Adaptation  based  on  Spectral  Tilt 

According  to  the  statistical  results  of  the  estimated 
spectral  tilt  of  both  the  inverse  filtered  and  the  LF 
modeled  differentiated  glottal  flow  waveforms,  the  modeled 
one,  estimated  in  the  time-domain,  underestimates  the 
spectral  tilt  on  the  average.  This  is  attributed  to  the  lack 
of  high-frequency  components  in  the  source  model  used.  Thus, 
it  would  be  better  to  adapt  the  LF  model  estimated  in  the 
time-domain,  in  order  to  compensate  for  the  spectral  tilt 
under-estimation.  Our  approach  is  to  adjust  the  return 
phase,  ta,  of  the  LF  model  so  that  the  modeled  waveform  can 
approximate  the  spectral  trend  characteristics  of  the 
inverse  filtered  differentiated  glottal  flow  waveforms. 

The  frequency  response  of  the  LF  model  has  a zero  at 
dc,  a complex  pole  pair  at  a ± cog,  and  a real  pole  at  -e 
[Fant  and  Lin,  1988].  The  zero  is  due  to  the  fact  that  the 
integral  of  the  LF  model  time  function  is  equal  to  zero.  The 
complex  pole  pair  is  attributed  to  the  first  segment  of  the 
LF  model.  The  real  pole  is  due  to  the  return  phase,  ta, 
which  determines  the  second  part  of  the  LF  model.  The  zero 
and  complex  pole  pair  result  in  a spectral  roll-off  of  -6 
dB/oct.  The  return  phase  provides  a spectral  roll-off  of  -6 
dB/oct.  Since  the  return  phase  of  the  LF  model  is  usually 
needed  to  model  the  reduction  of  the  high  frequencies  in  the 
glottal  source  spectrum,  such  as  for  breathy  voices,  it  can 
be  used  to  control  the  spectral  tilt  of  the  model. 
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The  effect  of  the  return  phase  on  the  source  spectrum 
is  equivalent  to  that  of  a first  order  low-pass  filter  with 
a cutoff  frequency  Fa  = 1/ (2rcta)  [Fant  and  Lin,  1988].  Thus 
the  power  spectral  density  function  attributed  to  the  return 
phase  can  be  expressed  as: 

IS(Q)|2=ttW  (4'18) 


where  £1  is  the  analog  frequency  in  radians.  The  digital 
impulse  invariant  realization  of  equation  (4-18)  is  in  the 
form  [Childers  and  Durling,  1975] : 


S(z)  =- 
1 


l/ta 

e-T/ta  z"1 


(4-19) 


where  T is  the  sampling  period.  Without  regarding  the 
overall  dc  gain,  equation  (4-19)  can  be  interpreted  as  a 
real-pole  model  1/  (1  - zcz-1)  , where  the  coefficient  zc  is 
given  by 


(4-20) 


Equation  (4  20)  means  that  the  longer  the  return  phase,  the 
steeper  the  spectral  tilt,  and  the  greater  the  reduction  of 
the  high  frequencies  in  the  spectrum. 

The  compensation  algorithm  first  compares  the 
spectral— tilt s estimated  (by  using  the  three— pole 


source 
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model)  for  both  the  inverse  filtered,  differentiated  glottal 
waveforms  and  the  modeled,  differentiated  glottal  waveforms. 
Then,  using  the  relationship  of  equation  (4-20) , the  return 
phase,  ta,  of  the  LF  model  is  adjusted  to  approximate  the 
spectral  trend  of  the  inverse  filtered  differentiated 
glottal  flow  waveforms. 

Figure  4-6  shows  FFT  spectra  of  the  high-pass  filtered 
LF  model  for  different  values  of  ta  and  those  of  the 
corresponding  real-pole  model  derived  from  equation  (4-20) . 
We  can  see  that  the  spectral  tilt  of  the  LF  model  behaves 
like  that  of  a first  order  real-pole  model.  To  estimate  the 
spectral  tilt  of  the  inverse  filtered  differentiated  glottal 
flow  and  the  LF  model  waveform,  a second-order  real-pole 
model  was  used.  Figure  4-7  shows  FFT  spectra  of  inverse 


filtered  differentiated 

glottal 

flow, 

its 

LF 

model, 

and 

compensated  LF  model 

waveform. 

As 

can 

be 

seen 

from 

Figure  4-7- (c),  the  spectral  tilt  of  the  compensated  LF 
model  can  be  enhanced  by  adjusting  the  return  phase.  One 
side  effect  of  adjusting  the  return  phase  is  that  the 
settling  time,  tc,  is  also  adjusted  accordingly.  Hence, 
chance  of  a spectral  mismatch  may  increase  in  the 
low-frequency  region  of  the  compensated  LF  model,  although 
it  seems  less  important  than  the  general  spectral  tilt 
change  in  the  high-frequency  range. 
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[kHz] 

(a) 


(b) 


Figure  4-6.  FFT  spectra  of  (a)  LF  model  for  different 
values  of  ta  (0.1  ~ 0.9  ms) ; and  (b)  the  corresponding 
real-pole  model.  Note  that  LF  model  spectra  are 
differentiated  to  compensate  for  an  extra  pole  in  the 
LF  model. 
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Figure  4-7.  FFT  spectra  of  (a)  inverse  filtered 
differentiated  glottal  flow,  (b)  its  LF  model,  and  (c) 
compensated  LF  model . 
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4.3  Results  for  Statistical  Study  of  Glottal  Flow  Factors 

The  main  objective  of  this  portion  of  the  research  was 
to  determine  a statistically  representative  glottal  model 
parameter  set  and  its  range  for  each  voice  type,  including 
modal,  vocal  fry,  and  breathy  voices.  The  other  objective 
was  to  study  the  statistical  characteristics  of  the  LF  model 
parameter  sets  for  classifying  an  unknown  voice  type  into  a 
known  category  of  voice  type.  The  underlying  hypothesis  of 
our  modeling  of  vocal  quality  and  voice  characteristics  is 
that  various  voice  types  may  be  represented  by  distinctive 
vocal  fold  vibratory  characteristics  and  patterns  and,  thus, 
by  distinctive  characteristics  of  the  volume  velocity 
waveform  or  glottal  flow.  The  other  hypothesis  in  the 
statistical  analysis  is  that,  within  a speaker,  the  glottal 
flow  characteristics  vary  relatively  little,  unlike  the 
vocal  tract  characteristics,  across  sentences. 

Sets  of  glottal  model  parameters  for  different  vowels 
and  sentences  from  several  speakers  were  analyzed  by  glottal 
inverse  filtering.  For  sentences,  only  the  voiced  frames 
(identified  by  the  EGG  signal)  were  considered.  Thus  the 
data  included  multiple  voiced  sounds  from  multiple  speakers 
as  tabulated  in  Table  2-1.  For  each  sustained  vowel  the 
minimum  number  of  frames  analyzed  was  100.  The  number  of 
frames  for  a sentence  varied  depending  on  the  subject.  The 
LF  model  was  then  matched  to  the  inverse  filtered 
differentiated  glottal  flow  waveforms.  All  analyses  were 
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done  pitch  synchronously  and  thus  each  frame  length 
corresponds  to  a pitch-period.  After  the  model  matching, 
each  LF  model  timing  parameter  was  divided  by  the 
corresponding  pitch-period  for  a frame  in  order  to  get  a 
normalized  parameter.  The  normalized  LF  model  parameter  data 
can  be  regarded  as  a two  dimensional  matrix,  of  which  each 
row  represents  an  LF  model  parameter  vector  estimated  for  a 
pitch  period  of  specific  voiced  speech  signal.  Figure  4-8 
shows  the  histograms  for  the  normalized  timing  parameters  of 
the  LF  model  estimated  for  the  different  type  voices. 

Since  different  voice  type  indicates  the  variations  in 
the  "total  auditory  impression,"  we  examined  the  "averaged" 
characteristics  of  the  source  model  parameters  from  each 
subject  and  from  all  subjects  in  each  voice  type  category. 
Thus,  in  computing  all  statistics,  two  data  sets  were  used. 
We  call  the  first  data  set  a "pooled  data  set"  and  the 
second  data  set  as  a "subject's  mean  data  set".  The  pooled 
data  set  consists  of  all  parameter  vectors  from  all 
subjects,  while  the  subject's  mean  data  set  consists  of  the 
mean  parameter  vectors  for  each  subject.  Both  data  sets  were 
treated  equally,  i.e.,  no  special  treatment  (weighting)  was 
given  to  parameters  from  different  voiced  sounds  or  from 
different  subjects. 

4.3.1  Results  of  Simple  Statistics 

The  typical  mean  values  and  standard  deviations  (std) 
of  the  LF  model  parameters  estimated  for  different  voice 
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(a) 


Figure  4-8.  Histograms  of  normalized  LF  timing  parameters 
for:  (a)  modal,  (b)  vocal  fry,  and  (c)  breathy  type 

voices . 
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Figure  4-8.  Continued 
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Figure  4-8.  Continued 
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types  are  tabulated  in  Table  4-3  and  Table  4-4.  Note  that, 
in  these  tables,  (1)  all  mean  values  and  std  are  expressed 
in  percentage  [%]  of  pitch-period (pp) , (2)  tc  was  computed 

from  the  LF  model  time-function,  (3)  SQlf  was  computed  by 
SQlf  = tp/ (tc-tp)  , and  (4)  fO  was  computed  by  fO  = Fa/pp, 
where  Fs  is  the  sampling  frequency  (10  kHz) . Table  4-3  shows 
mean  values  and  standard  deviations  (std)  for  different 
subjects  within  each  voice  type  category.  Table  4-4  was 
formed  by  using  the  data  from  different  subjects  of  each 
voice  type.  In  these  tables  tc  is  an  approximation  to  the 
OQlf  defined  in  equation  (4-12) , because  all  timing 
parameters  were  normalized  with  respect  to  the  corresponding 
pitch  period.  The  SQlf,  defined  in  equation  (4-13),  was  also 
computed  for  each  analysis  frame. 

The  statistical  results  from  both  the  subject's  mean 
data  set  and  the  pooled  data  set,  which  are  tabulated  in 
Table  4-4,  reveal  clear  differences  between  the  mean  values 
of  the  LF  model  parameters  for  different  type  voices.  Also 
both  data  sets  gave  us  approximately  the  same  numerical 
results,  as  can  be  seen  in  Table  4-4  and  in  Figure  4-9.  From 
these  tables  the  mean  values  of  the  normalized  LF  model 
parameters  of  different  voice  types  (except  the  speed 
quotient  and  the  pitch  period)  can  be  ordered  in  magnitude 
as  fry,  modal,  and  then  breathy  voice.  The  mean  value  of  the 
normalized  tc,  which  is  equivalent  to  the  open  quotient 
(OQlf)  in  this  study,  shows  the  most  notable  differences 
across  different  type  voices.  The  mean  value  of  the  speed 
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quotient  (SQLF)  is  comparable  for  modal  and  vocal  fry 
phonations,  but  the  mean  value  for  a breathy  voice  is 
smaller.  This  means  that  a breathy  voice  requires  less  vocal 
effort,  on  the  average,  than  modal  or  vocal  fry  voices  do. 
It  appears  from  these  data  sets  that  all  glottal  factors 
examined  from  different  type  voices  have  continuous 
numerical  values,  and  the  range  of  these  values  may  overlap 
for  different  type  voices. 

We  also  computed  the  statistics  for  the  pitch  periods 
(represented  as  pp)  estimated  from  the  EGG  signals.  Our  data 
show  wide  variations  of  the  pitch  periods  for  different  type 
voices:  breathy  and  vocal  fry  data  show  comparable 
variations  in  pitch  period,  while  modal  voice  data  show  a 
relatively  small  variation.  The  ranking  of  voice  type 
according  to  increasing  pitch  period  (or,  equivalently, 
decreasing  fundamental  frequency)  was  breathy,  modal,  and 
vocal  fry  types,  as  can  be  seen  from  Table  4-4. 

Also  computed  were  the  typical  mean  values  and 
standard  deviations  (std)  of  the  spectral-tilt  compensated, 
normalized  LF  model  parameters  for  each  subject  (Table  4-5) 
and  for  different  type  voices  (Table  4-6) . In  Table  4-5  and 
Table  4-6,  only  values  for  ta,  tc,  and  SQlf  are  different 
from  those  in  Table  4-3  and  Table  4-4.  The  reason  for  this 
is  that  the  spectral-tilt  compensation  (for  the  LF  model 
estimated  in  the  time-domain)  only  adjusted  the  value  of  the 
parameter  ta,  and  then  tc  and  SQlf  were  computed  from  the 
compensated  model  waveforms.  When  compared  to  those  of  the 
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uncompensated,  normalized  LF  model  parameters,  the  mean 
values  of  ta  and  tc  were  reduced,  as  expected,  and  thus 
those  of  the  speed  quotient  were  increased.  But  the 
orderings  of  voice  types  within  these  parameters  (ta,  tc, 
and  SQlf)  remained  the  same  as  those  of  the  uncompensated 
data,  as  can  be  observed  from  Table  4-6  and  Figure  4-10. 

4.3.2  Results  of  One-way  ANOVA  Analysis 

To  study  the  statistical  characteristics  of  the  LF 
model  parameters  of  different  voice  types,  an  ANOVA 
(ANalysis  Of  VAriance)  was  conducted.  In  performing  the 
ANOVA,  each  LF  model  parameter  was  treated  independently, 
because  a unique  LF  timing  parameter  set  generates  a unique 
glottal  model  time  waveform.  Thus  a one-way  ANOVA  test  was 
conducted.  Though  the  SQlf  is  not  an  LF  parameter,  it  was 
included  in  the  one-way  ANOVA  test,  since  it  can  provide 
insight  to  the  general  pulse  shape,  i.e.,  the  pulse 
skewness.  For  both  the  uncompensated  and  the  compensated 
model  parameter  vectors,  two  data  sets  (total  4 data  sets) 
were  used  in  the  ANOVA  tests  including  (1)  a pooled  data 
set:  a data  set  consisting  of  all  parameter  vectors  from  all 
subjects,  and  (2)  a subject's  mean  data  set:  a data  set 

consisting  of  mean  parameter  vectors  for  each  subject. 
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Table  4-3.  Mean  values  and  standard  deviations  (std)  of  LF 
model  parameters  for  each  subject 


Pho- 

na- 

tion 

Type 

Sub- 

ject 

(***) 

tp 

[%] 

te 

[%] 

ta 

[%] 

tc 

[%] 

SQlf 

PP 

[ms] 

fO 

[Hz] 

modal 

DMHN 

(431) 

43.68 

(3.67) 

58.26 

(5.29) 

1.96 

(0.96) 

67.80 

(7.21) 

1.95 

(0.85) 

9.29 

(0.90) 

108.57 

(10.0) 

DRW 

(463) 

36.60 

(5.0) 

47.67 

(6.27) 

2.74 

(1.09) 

60.99 

(7.91) 

1.70 

(1.10) 

7.95 

(0.69) 

126.38 

(6.84) 

CKLN 

(400) 

44.32 

(3.67) 

60.96 

(3.31) 

2.16 

(0.98) 

71.53 

(5.93) 

1.86 

(1.26) 

8.33 

(0.55) 

120.5 

(7.69) 

average 

41.54 

(4.28) 

55.63 

(7.02) 

2.28 

(0.41) 

66.77 

(5.34) 

1.84 

(0.13) 

8.52 

(0.69) 

118.49 

(9.07) 

vocal 

fry 

CKLP 

(400) 

40.64 

(3.47) 

55.63 

(4.12) 

1.46 

(0.60) 

62.82 

(4.01) 

1.87 

(0.32) 

11.37 

(0.52) 

88.14 

(3.93) 

JTO 

(369) 

33.62 

(11.45) 

44.93 

(12.98) 

2.05 

(1.26) 

55.07 

(13.80) 

1.77 

(1.13) 

8.17 

(1.42) 

125.02 

(16.37) 

average 

37.13 

(4.97) 

50.28 

(7.57) 

1.76 

(0.29) 

58.94 

(5.48) 

1.82 

(0.07) 

9.77 

(2.26) 

106.58 

(26.07) 

brea- 

thy 

DMHP 

(205) 

58.31 

(5.63) 

81.31 

(7.62) 

2.70 

(0.69) 

93.17 

(4.65) 

1.75 

(0.46) 

10.65 

(1.62) 

96.27 

(17.03) 

EDR 

(393) 

39.59 

(6.33) 

54.01 

(8.44) 

4.48 

(1.35) 

75.01 

(11.31) 

1.16 

(0.34) 

7.77 

(1.02) 

130.64 

(15.08) 

GPM 

(250) 

46.71 

(11.56) 

72.43 

(16.28) 

2.86 

(1.15) 

84.56 

(12.85) 

1.32 

(0.85) 

9.99 

(1.32) 

103.31 

(35.4) 

JMS 

(475) 

52.27 

(7.17) 

78.14 

(8.94) 

4.73 

(2.20) 

94.38 

(7.63) 

1.45 

(1.10) 

5.08 

(0.73) 

200.36 

(34.4) 

average 

49.22 

(7.98) 

71.47 

(12.21) 

3.69 

(1.06) 

86.78 

(8.98) 

1.42 

(0.25) 

8.37 

(2.52) 

132.65 

(47.5) 

* tp/  te/  ta,  and  tc  are  in  percentage  of  pitch-period (pp) 

* tc  was  computed  from  the  LF  model 

* SQlf  = tp/(tc  - tp) 

* fO  = Fs/pp,  where  Fa  is  the  sampling  frequency (10  kHz) 
***  number  of  data  samples  used 
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Table  4-4.  Mean  values  and  standard  deviations  (std)  of  LF 
model  parameters  for  different  voice  types 


Pho- 

na- 

tion 

Type 

Data 

set 

type 

(★**) 

[%] 

te 

[%] 

ta 

(%] 

tc 

[%] 

SQlf 

PP 

[ms] 

fO 

[Hz] 

modal 

pooled 

(1294) 

41.34 

(5.49) 

55.30 

(7.77) 

2.30 

(1.07) 

66.52 

(8.35) 

1.84 

(1.08) 

8.51 

(0.92) 

118.63 

(11.16) 

sub- 
ject' s 
mean 
(3) 

41.54 

(4.28) 

55.63 

(7.02) 

2.28 

(0.41) 

66.77 

(5.34) 

1.84 

(0.13) 

8.52 

(0.69) 

118.49 

(9.07) 

vocal 

fry 

pooled 

(769) 

37.27 

(9.02) 

50.49 

(10.87) 

1.75 

(1.02) 

59.10 

(10.70) 

1.82 

(0.81) 

9.83 

(1.91) 

105.84 

(21.82) 

sub- 

ject's 

mean 

(2) 

37.13 

(4.97) 

50.28 

(7.57) 

1.76 

(0.29) 

58.94 

(5.48) 

1.82 

(0.07) 

9.77 

(2.26) 

106.58 

(26.07) 

brea- 

thy 

pooled 

(1323) 

48.39 

(10.23) 

70.38 

(15.14) 

3.99 

(1.83) 

86.59 

(12.74) 

1.38 

(0.82) 

7.67 

(2.46) 

145.18 

(51.3) 

sub- 
ject' s 
mean 
(4) 

49.22 

(7.98) 

71.47 

(12.21) 

3.69 

(1.06) 

86.78 

(8.98) 

1.42 

(0.25) 

8.37 

(2.52) 

132.65 

(47.51) 

* tp/  te/  ta/  and  tc  are  in  percentage  of  pitch-period (pp) 

* tc  was  computed  from  the  LF  model 

* SQlf  = tp/  (tc  - tp) 

* fO  = Fg/pp,  where  Fs  is  the  sampling  frequency (10  kHz) 
***  number  of  data  samples  used 
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Figure  4-9.  Mean  values  and  standard  deviations  (T)  of 
normalized  LF  model  parameters  for  different  voice 
types  based  on:  (a)  the  pooled  data  set,  (b)  the 

subject's  mean  data  set 
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Table  4-5.  Mean  values  and  standard  deviations  (std)  of 
compensated  LF  model  parameters  for  each  subject 


Pho- 

na- 

tion 

Type 

Sub- 

ject 

(***) 

tp 

[%] 

t-e 

[%] 

ta 

[%] 

tc 

[%] 

SQlf 

PP 

[ms] 

fO 

[Hz] 

modal 

DMHN 

(431) 

43.68 

(3.67) 

58.26 

(5.29) 

0.15 

(0.46) 

59.77 

(5.56) 

2.87 

(1.10) 

9.29 

(0.89) 

108.57 

(10.0) 

DRW 

(463) 

36.60 

(5.0) 

47.67 

(6.27) 

0.52 

(1.05) 

51.09 

(8.23) 

3.05 

(1.59) 

7.95 

(0.69) 

126.38 

(6.84) 

CKLN 

(400) 

44.32 

(3.67) 

60.96 

(3.31) 

0.58 

(1.07) 

64.65 

(6.17) 

2.43 

(1.13) 

8.33 

(0.55) 

120.5 

(7.69) 

average 

41.54 

(4.28) 

55.63 

(7.02) 

0.41 

(0.23) 

58.50 

(6.87) 

2.78 

(0.32) 

8.52 

(0.69) 

118.49 

(9.07) 

vocal 

fry 

CKLP 

(400) 

40.64 

(3.47) 

55.63 

(4.12) 

0.20 

(0.54) 

57.35 

(4.98) 

2.52 

(0.46) 

11.37 

(0.52) 

88.14 

(3.93) 

JTO 

(369) 

33.62 

(11.45) 

44.93 

(12.98) 

0.3 

(0.95) 

47.47 

(13.72) 

2.68 

(1.23) 

8.17 

(1.42) 

125.02 

(16.37) 

average 

37.13 

(4.97) 

50.28 

(7.57) 

0.25 

(0.07) 

52.41 

(6.99) 

2.60 

(0.12) 

9.77 

(2.26) 

106.58 

(26.07) 

brea- 

thy 

DMHP 

(205) 

58.31 

(5.63) 

81.31 

(7.62) 

2.62 

(0.24) 

88.27 

(9.16) 

2.15 

(0.84) 

10.65 

(1.62) 

96.27 

(17.03) 

EDR 

(393) 

39.59 

(6.33) 

54.01 

(8.44) 

3.46 

(1.70) 

70.68 

(10.20) 

1.37 

(0.46) 

7.79 

(1.02) 

130.64 

(15.08) 

GPM 

(250) 

46.71 

(11.56) 

72.43 

(16.28) 

1.58 

(1.75) 

78.35 

(19.18) 

1.59 

(0.67) 

9.99 

(1.32) 

103.31 

(35.4) 

JMS 

(475) 

52.27 

(7.17) 

78.14 

(8.94) 

1.76 

(2.14) 

84.71 

(9.61) 

1.80 

(0.95) 

5.08 

(0.73) 

200.36 

(34.4) 

average 

49.22 

(7.98) 

71.47 

(12.21) 

2.35 

(0.87) 

80.46 

(7.80) 

1.73 

(0.33) 

8.37 

(2.52) 

132.65 

(47.5) 

* tp,  te/  ta,  and  tc  are  in  percentage  of  pitch-period (pp) 

* tc  was  computed  from  the  LF  model 

* SQlf  = tp/(tc  - tp) 

* fO  = Fg/pp,  where  Fa  is  the  sampling  frequency (10  kHz) 
***  number  of  data  samples  used 
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Table  4-6.  Mean  values  and  standard  deviations  (std)  of 
compensated  LF  model  parameters  for  different  voice 
types 


Pho- 

na- 

tion 

Type 

Data 

set 

type 

(sam- 

ples) 

[%] 

te 

[%] 

ta 

[%] 

tc 

[%] 

SQlf 

PP 

[ms] 

fO 

[Hz] 

modal 

pooled 

(1294) 

41.34 

(5.49) 

55.30 

(7.77) 

0.41 

(0.92) 

58.17 

(8.84) 

2.80 

(1.33) 

8.51 

(0.92) 

118.63 

(11.16) 

sub- 
ject' s 
mean 
(3) 

41.54 

(4.28) 

55.63 

(7.02) 

0.41 

(0.23) 

58.50 

(6.87) 

2.78 

(0.32) 

8.52 

(0.69) 

118.49 

(9.07) 

vocal 

fry 

pooled 

(769) 

37.27 

(9.02) 

50.49 

(10.87) 

0.25 

(0.77) 

52.61 

(11.29) 

2.60 

(0.92) 

9.83 

(1.91) 

105.84 

(21.82) 

sub- 

ject's 

mean 

(2) 

37.13 

(4.97) 

50.28 

(7.57) 

0.25 

(0.07) 

52.41 

(6.99) 

2.60 

(0.12) 

9.77 

(2.26) 

106.58 

(26.07) 

brea- 

thy 

pooled 

(1323) 

48.39 

(10.23) 

70.38 

(15.14) 

2.36 

(2.15) 

79.85 

(14.00) 

1.69 

(0.81) 

7.67 

(2.46) 

145.18 

(51.3) 

sub- 
ject' s 
mean 
(4) 

49.22 

(7.98) 

71.47 

(12.21) 

2.35 

(0.87) 

80.46 

(7.80) 

1.73 

(0.33) 

8.37 

(2.52) 

132.65 

(47.51) 

* tp/  te,  ta,  and  tc  are  in  percentage  of  pitch-period (pp) 

* tc  was  computed  from  the  LF  model 

* SQlf  = tp/(tc  - tp) 

* fO  = Fg/pp,  where  Fs  is  the  sampling  frequency (10  kHz) 
***  number  of  data  samples  used 
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Figure  4-10.  Mean  values  and  standard  deviations  (T)  of 
compensated,  normalized  LF  model  parameters  for 
different  voice  types  based  on:  (a)  the  pooled  data 

set,  (b)  the  subject's  mean  data  set 
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Before  describing  the  ANOVA  test  results,  we  mention 
the  conditions  of  the  ANOVA  test.  The  three  basic 
assumptions  for  the  analysis  of  variance  are: 

(1)  In  each  group  the  dependent  variables  are  normally 
distributed. 

(2)  The  population  variances  for  the  groups  are  equal.  This 
is  called  the  homogeneity  of  variance  assumption. 

(3)  The  observations  are  independent. 

For  real  world  data,  these  conditions  are  not  easily  met. 
Since  all  our  data  were  measured  independently,  we  consider 
only  the  first  two  conditions.  To  accommodate  relatively 
small  violations  of  these  conditions,  several  ANOVA 
procedures  have  been  proposed.  Among  them,  the  Tukey 
procedure  is  favored  for  paired  comparisons.  The  ANOVA 
analysis  with  the  Tukey  procedure  is  robust  with  respect  to 
a violation  of  the  normality  assumption  and  unequal 
variances  [Stevens,  1990] . 

The  results  of  the  ANOVA  tests  for  the  normalized  LF 
model  parameters,  estimated  in  the  time-domain,  and  the 
spectral-tilt  compensated,  normalized  LF  model  parameters 
are  tabulated  in  Table  4-7  and  Table  4-8,  respectively.  For 
both  the  uncompensated  and  the  compensated  parameters,  the 
pooled  data  sets  show  negligible  significance  levels  for  all 
parameters.  This  means  that,  in  the  pooled  data  set,  each 
estimated  LF  model  parameter  has  statistically  different 
characteristics  across  different  voice  types.  The  results  of 
the  ANOVA  tests  on  the  subject's  mean  data  set,  however, 
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show  relatively  high  significance  levels.  This  implies  that 
the  LF  model  parameters,  averaged  for  each  subject,  do  not 
have  statistically  different  characteristics  across 
different  voice  types.  The  reason  for  this  is  that  variance 
information  of  each  parameter  for  each  subject  cannot  be 
considered  by  the  ANOVA  procedure  when  using  the  subject's 
mean  data  set. 

When  comparing  the  significance  levels  between  the 
uncompensated  and  the  compensated  model  parameters,  the 
three  parameters,  ta,  tc,  SQLF,  showed  reduced  significance 
levels  when  compensated.  This  was  particularly  so  for  the 
parameter  ta,  which  controls  the  spectral  tilt  of  the  model 
spectrum.  The  significance  level  of  the  compensated  ta  was 
0.9  and  that  of  the  original  ta  was  5.81.  The  significance 
level  of  the  compensated  SQLF  was  also  much  less  (from  5.12 
to  0.8)  than  that  of  the  original  SQLF,  while  the  reduction 
of  the  significance  level  of  the  tc  was  not  much.  Based  on 
these  results,  it  can  be  said  that  the  compensated  model 
parameters,  ta,  tc,  SQu,  are  more  significant  statistically 
across  voice  types  than  uncompensated  parameters. 

In  summary,  our  data  showed  that  the  parameters  ta, 
tc,  and  SQlf  are  statistically  more  important  for 
characterizing  (modeling)  the  different  voice  types  than  are 
the  parameters  tp  and  te,  in  both  the  uncompensated  and 
spectral-tilt  compensated  cases.  Among  these  parameters  the 
parameter  tc,  computed  from  the  LF  model,  (or  equivalently 
the  OQlf)  is  statistically  most  significant  across  voice 
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types.  In  addition,  the  pitch  period,  pp,  has  the  largest 
significance  level,  meaning  that  it  is  relatively  less 
significant  statistically  than  the  other  parameters. 


Table  4-7.  ANOVA  results  for  normalized  LF  model 
parameters 


Significance  level  [%] 

Data  set 
type 

tp 

te 

ta 

tC 

SQlf 

PP 

pooled 

data 

0.01 

0.01 

0.01 

0.0 

0.01 

0.01 

subject' s 
mean 

15.08 

8.80 

5.81 

0.88 

5.12 

72.76 

Table  4-8.  ANOVA  results  for  spectral-tilt  compensated, 
normalized  LF  model  parameters 


Data  set 
type 

Significance  level  [%] 

tp 

te 

ta 

tc 

SQlf 

PP 

pooled 

data 

0.01 

0.01 

0.01 

0.0 

0.01 

0.01 

subject' s 
mean 

15.08 

8.80 

0.9 

0.71 

0.8 

72.76 
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4.3.3  Results  of  Discriminant  Analysis 

In  order  to  study  the  capability  of  the  LF  model 
parameter  set  to  classify  an  unknown  voice  type  into  a 
specific  category,  statistical  discriminant  analysis  was 
performed.  We  excluded  the  parameter  pp,  the  pitch  period, 
from  the  parameter  vector,  since  the  results  of  the  ANOVA 
tests  showed  this  parameter  to  be  of  little  statistical 
value.  Thus,  the  model  parameter  vector  consisted  of  tp,  te, 
ta,  and  tc,  and  SQlf-  For  both  the  uncompensated  and  the 
spectral-tilt  compensated  parameter  vectors,  two  data  sets 
(total  4 data  sets)  were  treated  independently:  (1)  a pooled 

data  set:  a data  set  consisting  of  all  parameter  vectors 

from  all  subjects,  and  (2)  a subject's  mean  data  set:  a data 
set  consisting  of  mean  parameter  vectors  for  each  subject. 

In  conducting  the  discriminant  analysis  we  assumed 
that  (1)  each  group,  i.e.,  each  voice  type,  had  the  same 
variance  in  the  model  parameter  vector  space,  and  that  (2) 
the  a priori  probability  of  the  occurrence  of  each  voice 
type  was  the  same.  While  the  second  assumption  is 
reasonable,  the  first  assumption  was  verified  using  the 
Student  t-test,  performed  independently,  on  each  data  set. 
Under  these  conditions,  the  generalized  squared  distance, 
Dj2  (x)  , between  a test  vector,  x,  and  the  j-th  reference 
vector,  Xj,  can  be  represented  as: 
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(x)  = d§  (x)  = (x  - Xj) ' • V"1  • (x  - Xj)  (4-21) 

In  equation  (4-21),  x is  a test  vector  to  be  classified,  Xj 
is  the  j-th  reference  vector,  which  is  an  averaged  parameter 
vector  over  the  elements  of  the  j-th  group,  v is  the  pooled 
covariance  matrix,  and  [']  denotes  the  matrix  transpose 
operation.  The  decision  strategy  is  that  a test  vector  is 
classified  into  the  group  with  the  smallest  distance  between 
the  reference  and  test  vectors.  In  computing  the  distance, 
no  weighting  was  assigned  to  the  elements  of  the  parameter 
vectors,  i.e.,  all  parameters  were  treated  equally. 

The  subject's  mean  data  sets  were  all  classified 
correctly,  i.e.,  100%  classification  rates,  as  can  be  seen 
in  Table  4-9  and  Table  4-10,  when  using  both  the 
uncompensated  and  the  compensated  model  parameter  vectors. 

The  classification  rates  of  the  pooled  data  sets  were 
from  58.91%  to  80.2%,  as  can  be  seen  from  Table  4-11  and 
Table  4-12.  When  the  compensated  parameter  vectors  were 
used,  classification  rates  of  both  the  normal  and  breathy 
types  increased,  compared  to  those  rates  when  the 
uncompensated  vectors  were  used.  In  contrast,  the 
classification  rate  of  the  vocal  fry  type  decreased,  when 
the  compensated  parameter  vectors  were  used.  This  decrease 
was  largely  attributed  to  the  vocal  fry  data  being 
classified  as  modal  type.  Thus,  it  seems  that  the  glottal 
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waveform  characteristics  of  our  vocal  fry  data  were 
relatively  similar,  on  the  average,  to  those  for  the  modal 
voice.  But  the  vocal  fry  voice  showed  more  variations  in  the 
glottal  waveform  characteristics  than  did  the  modal  voice. 
The  pooled  data  sets  from  both  the  uncompensated  and 
spectral-tilt  compensated  model  parameter  vectors  showed 
that  most  of  the  mis-classif ications  occurred  between  the 
modal  type  and  the  other  voice  types.  However, 
mis-classif ications  between  the  breathy  and  the  vocal  fry 
type  were  quite  small. 

In  summary,  to  classify  a voice  into  a certain  type 
based  on  the  glottal  factors  measured,  one  should  examine 
the  "averaged  behavior"  of  the  glottal  factors.  Also  the 
classification  results  for  the  subject's  mean  data  sets  can 
be  experimental  evidence  for  the  perceptual  definition  of 
the  voice  type:  the  variations  in  the  "total  auditory 
impression"  the  listener  experiences  upon  hearing  the  speech 
of  another  talker  [Childers  et  al.  1987a].  It  can  also  be 
said  that  a modal  voice  is  usually  more  difficult  to 
correctly  identify  than  the  other  type  voices,  while  a 
breathy  voice  and  a fry  voice  can  be  differentiated  easily. 
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Table  4-9.  Classification  analysis  results  of  the 
subject's  mean  data  set  for  the  uncompensated  parameter 
vectors 


Classification  results  in  % into  type 


from 

type 

modal 

vocal  fry 

breathy 

modal 

100 

0 

0 

vocal 

fry 

0 

100 

0 

breathy 

0 

0 

100 

Table  4-10.  Classification  analysis  results  of  the 
subject's  mean  data  set  for  the  compensated  parameter 
vectors 


Classification  results  in  % into  type 


from 

type 

modal 

vocal  fry 

breathy 

modal 

100 

0 

0 

vocal 

fry 

0 

100 

0 

breathy 

0 

0 

100 
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Table  4-11.  Classification  analysis  results  of  the  pooled 
data  set  for  the  uncompensated  parameter  vectors 


Classification  results  in  % into  type 


from 

type 

modal 

vocal  fry 

breathy 

modal 

60.59 

27.51 

11.90 

vocal 

fry 

26.92 

71.13 

1.95 

breathy 

23.66 

2.72 

73.62 

Table  4-12.  Classification  analysis  results  of  the  pooled 
data  set  for  the  compensated  parameter  vectors 


Classification  results  in  % 

into  type 

from 

type 

modal 

vocal  fry 

breathy 

modal 

61.67 

29.06 

9.27 

vocal 

fry 

36.41 

58.91 

4.68 

breathy 

12.47 

7.33 

80.20 

4.4  Application  and  Evaluation 


To  evaluate  the  typical  values  of  the  glottal  source 
model  parameters,  we  synthesized  several  speech  tokens  by 
changing  the  characteristics  of  the  glottal  model  of  a 
formant  synthesizer  based  on  the  model  parameters  we 
estimated.  Then  we  verified  the  synthesized  speech  tokens 
through  informal  listening  tests.  In  this  section  we  shall 
describe  the  speech  synthesis  procedure  used  and  then 
present  some  examples  of  the  synthesized  speech  data  as  well 
as  the  results  of  the  informal  listening  tests. 

4.4.1  Speaker  Modeling  for  Different  Voice  Types 

In  order  to  generate  or  synthesize  various  type 
voices,  using  fixed  vocal  tract  characteristics,  we  need  a 
speech  synthesizer  as  well  as  the  specifications  of 
different  glottal  sources  for  various  voice  types.  As 
pointed  out  by  Lee  and  Childers  [1989],  the  four  glottal 
factors  important  for  characterizing  several  voice  types 
are:  (1)  glottal  pulse  width,  (2)  glottal  pulse  skewness, 

(3)  abruptness  of  glottal  closure,  and  (4)  turbulent  noise. 
Since  the  LF  model  can  represent  the  first  three  glottal 
factors  effectively,  the  LF  model  parameters  estimated  for 
different  type  voices  were  employed  to  represent  the  glottal 
characteristics  of  various  type  voices.  Using  a formant 
synthesizer  and  the  LF  model,  several  speech  tokens  were 
synthesized.  The  formant  speech  synthesizer  was  able  to 


137 


138 


control  independently  the  characteristics  of  the  glottal 
source  and  those  of  the  vocal  tract  [Pinto  et  al.,  1989]. 
Through  informal  listening  tests,  the  glottal  models  for 
different  voice  types  were  verified. 

Before  synthesis,  the  formant  synthesizer  required 
that  all  necessary  parameters  be  provided.  To  provide  an 
experimental  basis,  a sustained  vowel,  / i / , and  a voiced 
sentence,  "We  were  away  a year  ago,"  spoken  by  a normal  male 
subject,  DMH,  were  analyzed  by  using  the  pitch-synchronous 
LP  analysis  with  the  aid  of  the  EGG  signal.  The  formant 
structure  contours,  which  consist  of  formant  frequencies, 
bandwidths,  but  without  formant  amplitudes,  were  extracted 
from  the  linear  predictive  coefficients  for  all  analysis 
frames.  The  gain  contours  for  the  voiced  and  fricative 
excitation  sources  were  also  obtained  from  the  speech  signal 
with  the  aid  of  the  EGG  signal.  All  formants  and  gain 
contours  were  smoothed  by  a nonlinear  smoothing  algorithm 
before  being  used  by  the  formant  synthesizer.  Figure  4-11 
shows  the  sentence  analyzed.  Figure  4-11  also  shows  the 
spectrogram,  the  smoothed  formant  frequency  tracks,  the 
fundamental  frequency  contour,  and  the  voiced  (av)  and 
unvoiced  gain  (af)  contours  of  the  excitation  sources,  which 
were  used  to  synthesize  the  test  sentences.  Note  that  the 
smoothed  formant  frequency  tracks  were  interpolated  in 
unvoiced  or  silence  regions,  as  shown  in  Figure  4-11- (c). 

The  formant  synthesizer  used  the  glottal  model 
parameters  and  the  excitation-source  gain  contours  to 
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Figure  4-11.  Sentence  "We  were  away  a year  ago.":  (a) 
Speech  signal,  (b)  Spectrogram  of  speech  signal,  (c) 
Formant  frequency  tracks,  (d)  Fundamental  frequency 
contour,  and  (e)  voiced  & (f)  unvoiced 

excitation-source  gain  contours  estimated. 


frequency  [Hz] 
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Figure  4-11.  Continued 
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generate  a glottal  flow  signal,  which  was  then  used  to 
synthesize  the  speech  signal.  We  used  the  statistical 
results  of  the  LF  model  parameters  as  well  as  those  of  the 
frequency  intensity  ratio  and  the  standard  deviation  (std) 
of  fundamental  frequency  as  the  representative  source  models 
to  generate  different  type  voices.  The  turbulent  noise 
source  was  a white  Gaussian  random  number  generator  and  was 
amplitude  modulated  pitch  synchronously  [Lee  and  Childers, 
1989] . The  intensity  of  the  noise  source  was  controlled  by 
the  aspiration  gain  contour  (af) , which  was  approximated  by 
the  voiced  gain  contour  (av)  adjusted  by  the  frequency 
intensity  ratio.  Since  the  analysis  was  done  pitch 
synchronously,  each  synthesis  frame  corresponded  to  a pitch 
period.  In  synthesis,  however,  all  frame  sizes  were  kept 
constant,  while  the  fundamental  frequency  was  manipulated 
independently.  The  amount  of  pitch  perturbation  was 
incorporated  as  a percentage  variation  in  the  fundamental 
frequency  and  implemented  by  a random  number  generator.  In 
synthesizing  sustained  vowels  and  sentences,  only  the 
parameters  related  to  the  glottal  source  model,  for  example, 
the  LF  model  parameters,  the  turbulent  noise  component,  and 
the  pitch  perturbation  were  adjusted  for  different  type 
voices,  but  the  formant  structure  and  the  pitch  contour  were 
kept  the  same. 

The  modeled  glottal  flow  waveforms  for  synthesizing 
different  type  voices  are  shown  in  Figure  4-12.  The 
synthesized  vowel  signals  are  shown  in  Figure  4-13.  For 
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Figure  4-12.  Modeled  glottal  flow  waveforms 
synthesizing  different  type  voices:  (a)  modal, 

vocal  fry,  and  (c)  breathy  voices. 
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Figure  4-13.  Synthesized  vowels,  /i/,  of  different  types 
by  using  the  corresponding  modeled  glottal  flow 
waveforms  in  Figure  4-12:  (a)  modal,  (b)  vocal  fry,  and 

(c)  breathy  voices. 
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comparison  purposes,  the  natural  vowel  signals  are  also 
shown  in  Figure  4-14.  The  synthesized  sentences  and 
spectrograms,  using  the  modeled  glottal  flow  waveforms,  are 
shown  in  Figure  4-15.  Examples  of  natural  sentences  and 
spectrograms  of  different  voice  types  are  also  shown  in 
Figure  4-16,  Figure  4-17,  and  Figure  4-18.  It  can  be 
observed  from  these  figures  that  the  glottal  characteristics 
of  the  breathy  voices  can  be  easily  characterized  from  those 
of  the  modal  and/or  vocal  fry  voices.  Although  the  glottal 
characteristics  of  the  vocal  fry  voices  are  similar,  on  the 
average,  to  those  of  the  modal  voices,  they  show  relatively 
larger  variations. 

4.4.2  Perceptual  Evaluation 

In  order  to  evaluate  the  values  of  glottal  source 
factors  obtained  by  the  statistical  analysis,  we  conducted 
an  informal  listening  test  on  the  synthesized  speech  tokens. 
For  each  glottal  factor  under  investigation,  a group  of 
synthetic  vowels  and  voiced  sentences  were  generated  by 
changing  an  appropriate  glottal  parameter  while  not  varying 
the  formant  structure  and  the  fundamental  frequency. 
Comparisons  between  synthesized  speech  tokens  of  different 
types  were  made  based  on  natural  speech  signals  of 
corresponding  types. 

We  first  compared  speech  tokens  generated  by 
incorporating  a turbulent  noise  source  with  those  generated 
without  the  noise  source.  This  was  done  separately  for  each 


[amplitude]  [amplitude]  [amplitude] 


145 


(a) 


lOOOO 


5000  --I 


O - 


-5000 1 


- 1 oooo 


20 


30 

t i me  [ms] 


40 


50 


(b) 


(c) 


Figure  4-14.  Natural  vowel,  /i/,  signals  of  different 
types:  (a)  modal,  (b)  vocal  fry,  and  (c)  breathy 

voices . 
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Figure  4-15.  (1)  Synthesized  sentences  and 

spectrograms  of  different  types:  (a)  modal,  (b) 

fry?  and  (c)  breathy  voices. 
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Figure  4-15.  Continued 
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Figure  4-15.  Continued 
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Figure  4-16.  Natural  modal  voice:  (a) 

away  a year  ago,"  (b)  Spectrogram. 
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Figure  4-17.  Natural  fry  voice:  (a)  Sentence,  "We 

away  a year  ago,"  (b)  Spectrogram. 
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Figure  4-18.  Natural  breathy  voice:  (a)  Sentence 

were  away  a year  ago,"  (b)  Spectrogram. 
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voice  type.  The  intensity  of  the  noise  source  was  controlled 
by  the  high-to-low  frequency  intensity  ratio  for  different 
voice  types  and  was  kept  the  same  for  further  tests.  From 
this  test  we  found  that  the  incorporation  of  the  turbulent 
noise  source  generally  enhance  the  naturalness  of  the 
synthesized  speech.  Turbulent  noise  source  gave  more  salient 
effect  for  a breathy  voice  than  for  modal  and/or  vocal  fry 
voices.  Thus,  for  further  listening  test,  we  used  speech 
tokens  synthesized  with  the  noise  source. 

We  then  compared  the  uncompensated  and  the 
spectral-tilt  compensated  source  model  parameters  for  each 
voice  type.  For  modal  and  vocal  fry  voices,  the  synthesized 
speech  signals  generated  by  the  compensated  model  parameters 
were  rated  sounding  better  than  those  generated  by  the 
uncompensated  ones.  However,  for  a breathy  voice  the 
uncompensated  model  parameters  gave  the  better  results. 
Thus,  for  the  breathy  voice,  the  spectral-tilt  compensation 
for  the  LF  model  parameters  may  not  be  necessary.  In  other 
words,  the  steep  spectral  tilt  or  the  progressive  glottal 
closure  is  a good  perceptual  correlation  of  the  breathy 
voice.  Based  on  these  results  we  did  not  consider  the 
uncompensated  model  parameters  further  in  the  listening 
test . 

According  to  the  ANOVA  results  on  our  data,  the  three 
parameters,  tc  (equivalently,  0QLF) , ta,  and  SQLF  of  the  LF 
source  model  were  important  in  characterizing  different 
voice  types.  Although  these  parameters  are  related  to  each 
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other  (note  that  tc  and  SQLf  were  computed  from  the  model) , 
they  were  controlled  independently.  In  other  words,  when 
adjusting  a parameter,  all  the  other  parameters  were  kept 
the  same.  As  ta  increases,  the  synthesized  voices  sound  more 
lax  (or,  hypo-functional) . However,  the  parameter,  SQLF, 
gave  the  opposite  results  for  all  voice  types,  i.e.,  the 
increase  of  SQlf  (or,  equivalently,  increasing  tp  with 
constant  tc)  resulted  in  tense  (or,  hyper-functional) 
voices.  As  tc  increases,  the  synthesized  voices  sound  more 
soft.  Among  these  three  parameters,  tc  (OQlf)  is  the  most 
prominent  one  in  characterizing  different  type  voices. 

The  pitch  period  (pp)  was  not  investigated  separately. 
However,  the  pitch  perturbation  (variation)  was  important 
for  differentiating  the  vocal  fry  type  from  the  modal  type. 
Note  that  statistical  results  of  our  vocal  fry  data  were 
more  similar  to  the  modal  data  than  those  of  the  breathy 
data . 

It  is  not  easy  to  use  a listening  test  to  determine 
the  ranges  of  values  of  glottal  factors  for  a certain  type 
voice.  It  is  difficult  because  there  are  multiple  parameters 
(the  type  variation  is  affected  by  all  parameters)  which 
would  require  too  large  synthesized  data  samples  to  be 
practical.  To  determine  the  ranges  of  values  of  glottal 
factors,  we  used  the  subject's  mean  data  set  from  the 
compensated  source  model  parameter  values,  since  this  data 
set  yielded  the  best  results  for  both  the  ANOVA  and  the 
discriminant  analyses.  Table  4-13  shows  the  ranges  of  the 
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values  of  the  source  model  parameters  computed  in  99.0 [%] 
confidence  level.  The  formula  used  is: 

mean  ± 2.576-  ste  (4-22) 

where  ste  is  the  standard  error  given  by 
std  stadard  deviation 

ste=  — = (4-23) 

df  degrees  of  freedom 

When  using  Table  4-13,  one  should  note  that  the  values 
of  tp  and  te  must  be  less  than  that  of  tc.  More 
specifically,  to  be  valid  as  a set  of  LF  model  parameters, 
these  parameters  should  satisfy  relationships  of: 

0 £tp*te£tc  (4-24) 


and 


ta  ^ 0 


(4-25) 


Note  also  that  tp  and  tc  are  related  by  SQLF: 


(4-26) 


Thus  the  rule  to  select  a set  of  LF  model  parameters  from 
Table  4-13,  in  order  to  synthesize  different  type  voices, 
could  be  either 
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• Specify  fO  and  tc,  then  select  tp/  te/ 
and  ta, 

or 

• Specify  fO,  tc  and  SQLF,  then  compute  tp 
by  using  equation  (4-26)  and  select  te  and 
ta  • 

Also  note  that  when  specifying  the  model  parameters  (except 
SQu  and  pp)  for  all  three  voice  types  at  the  same  time,  the 
ranking  of  voice  types  according  to  increasing  values  of 
each  parameter  should  be  vocal  fry,  modal,  and  breathy 
types,  as  shown  in  Table  4-13.  The  perturbation  of  the  pitch 
period  should  be  also  considered.  The  ranking  of  voice  type 
according  to  increasing  pitch  perturbation  should  be  modal, 
breathy,  and  vocal  fry  types. 

In  summary,  we  found  that  the  glottal  pulse  width,  the 
abruptness  of  glottal  closure,  and  the  spectral  tilt  were 
distinctive  factors  for  differentiating  a breathy  voice  from 
modal  and  vocal  fry.  Secondly,  the  pitch  perturbation 
(jitter  factor)  was  necessary  to  generate  vocal  fry.  It  was 
experimentally  verified  that  the  incorporation  of  a 
turbulent  noise  source  generally  enhance  the  naturalness  of 
the  synthesized  speech.  Turbulent  noise  was  more  salient 
for  a breathy  voice  than  for  modal  and/or  vocal  fry  voices. 
Finally,  with  carefully  chosen  source  model  parameters, 
speech  that  mimics  various  voices  can  be  systematically 
synthesized. 
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Table  4-13.  Ranges  of  values  of  the  spectral-tilt 
compensated,  normalized  LF  model  parameters  from  the 
subject's  mean  data  set  for  different  voice  types 
(confidence  level:  99.0 [%]) 


Phona- 

tion 

Type 

( * * *** * * j 

[%] 

t-e 

[%] 

ta 

[%] 

t-C 

[%] 

SQlf 

PP 

[ms] 

fO 

[HZ] 

modal 

41.54 

55.63 

0.41 

58.50 

2.78 

8.52 

118.49 

/ O \ 

± 

± 

± 

+ 

± 

± 

± 

\ 3 ) 

5.51 

7.51 

0.30 

8.85 

0.55 

0.89 

11.68 

vocal 

37.13 

50.28 

0.25 

52.41 

2.60 

9.77 

106.58 

fry 

± 

± 

± 

± 

± 

± 

± 

(2) 

12.80 

19.50 

0.18 

18.0 

0.31 

5.82 

67.16 

brea- 

49.22 

71.47 

2.35 

80.46 

1.73 

8.37 

132.65 

thy 

± 

± 

± 

± 

± 

± 

± 

(4) 

6.85 

10.48 

0.75 

6.70 

0.28 

2.16 

40.8 

* tp,  te,  ta/  and  tc  are  in  percentage  of  pitch-period (pp) 

* tc  was  computed  from  the  LF  model 

* SQlf  = tp/(tc-tp) 

* fO  = Fs/pp,  where  Fs  is  the  sampling  frequency (10  kHz) 

***  the  number  of  data  used 


CHAPTER  5 
CONCLUSIONS 

5 . 1 Summary 

The  acoustic  variability  of  the  glottal  source  factors 
for  different  voices  were  investigated  based  on  the  linear 
speech  production  model.  The  major  techniques  for  this 
research  were  glottal  inverse  filtering  and  source  model 
matching.  The  procedure  of  glottal  inverse  filtering  was 
automated  using  both  the  speech  and  EGG  signals.  Our  inverse 
filtering  method  could  process  sentences  as  well  as 
sustained  vowels.  For  sentences,  the  inverse  filtering  was 
performed  on  voiced  segments  only.  A set  of  statistical 
descriptors  of  glottal  factors  for  various  voice  types  was 
obtained  with  rules  to  specify  a source  model  parameters 
useful  for  synthesizing  different  voices. 

The  inverse  filtered  glottal  flow  waveforms  for 
different  voices  showed  typical  patterns  that  can  be 
characterized  by  pitch  period,  pitch  perturbation,  pulse 
width,  pulse  skewing,  abruptness  of  closure,  and  turbulent 
noise.  The  spectral  characteristics  of  the  glottal  flow 
waveforms  for  different  type  voices  were  also  different  in 
spectral  tilt,  high-to-low  frequency  intensity  ratio,  etc. 
Therefore,  different  glottal  source  pulses  could  be  used  in 
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a speech  production  model  to  synthesize  various  voices. 
Also,  the  glottal  factors  could  be  identified  and  quantified 
to  reduce  the  acoustic  variability  of  features  used  in 
speech  recognition  systems. 

To  obtain  the  representative  values  and  ranges  of  the 
glottal  factors  for  different  voice  types,  the  LF  source 
model  was  employed.  For  the  inverse  filtered  glottal  flow 
waveforms,  the  LF  model  parameters  were  estimated,  in  a 
least  squared  error  sense,  so  that  the  model  matches  the 
time  and  frequency  domain  glottal  characteristics.  Typical 
values  and  ranges  of  the  LF  model  parameters  for  different 
voice  types,  namely  modal,  vocal  fry,  and  breathy  phonation 
types,  were  then  determined  through  statistical  analyses  of 
the  matched  model  parameters. 

The  ranking  of  voice  type  according  to  increasing 
pitch  period  was  breathy,  modal,  and  vocal  fry  voices. 
Variations  of  the  pitch  period  were  large  across  subjects. 
The  ranking  of  voice  type  according  to  increasing  glottal 
pulse  width,  abruptness  of  closure,  as  well  as  spectral  tilt 
were  vocal  fry,  modal,  and  breathy  voices.  Glottal  pulse 
skewing  for  modal  and  vocal  fry  voices  were  comparable,  but 
both  were  larger  than  for  breathy  voices.  Vocal  effort 
increases  from  breathy  to  vocal  fry  voices,  while  that  of 
modal  voices  is  in  between  the  two.  The  ranking  of  voice 
type  according  to  increasing  mean  a-value  was  vocal  fry, 
modal,  and  breathy.  The  ranking  of  voice  type  according  to 
increasing  spectral  tilt  was  vocal  fry,  modal,  and  breathy. 
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This  ordering  is  the  same  as  that  for  the  high-to-low 
frequency  intensity  ratios  (or  the  a-values)  . All  glottal 
factors  measured  have  continuous  numerical  values.  The 
ranges  of  these  values  may  overlap  for  different  type 
voices . 

To  verify  the  statistical  analysis  results  and  to 
obtain  rules  for  synthesizing  various  type  voices,  several 
speech  tokens  were  synthesized  by  using  a formant 
synthesizer.  The  synthesized  speech  tokens  were  evaluated 
perceptually  by  informal  listening  test.  The  synthesized 
speech  tokens,  based  on  the  statistical  results  obtained, 
mimicked  various  voice  types  reasonably  well,  when  compared 
with  different  natural  type  voices.  Other  findings  included: 

(1)  The  glottal  pulse  width,  the  abruptness  of  glottal 
closure,  and  the  spectral  tilt  were  distinctive  factors  for 
differentiating  a breathy  voice  from  modal  and  vocal  fry. 

(2)  In  general,  the  incorporation  of  a turbulent  noise 
source  enhanced  the  naturalness  of  the  synthesized  speech 
and  was  more  salient  for  a breathy  voice  than  for  modal 
and/or  vocal  fry  voices. 

(3)  Pitch  perturbation  (jitter  factor)  was  necessary 
to  generate  the  vocal  fry. 

Rules  to  specify  a set  of  LF  model  parameters,  in 
order  to  synthesize  different  type  voices,  were  developed  as 
follows : 

(1)  Specify  fO  and  tc,  then  select  tp,  te,  and  ta,  or 
Specify  fO,  tc  and  SQLF,  then  compute  tp  and  select  te  and 
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ta  • 

(2)  When  specifying  the  model  parameters  (except  SQLF 
and  pp)  for  all  three  voice  types  at  the  same  time,  the 
ranking  of  voice  types  according  to  increasing  values  of 
each  parameter  should  be  vocal  fry,  modal,  and  breathy 
types . 

(3)  The  ranking  of  voice  type  according  to  increasing 
pitch  perturbation  should  be  modal,  breathy,  and  vocal  fry 
types . 

The  results  obtained  and  the  method  developed  in  this 
study  can  serve  as  the  basis  for  further  study  of  (1)  the 
estimation  of  excitation  source  models  for  a broad  range  of 
voice  types  including  falsetto,  hoarse,  and  harsh  voices, 
(2)  the  quantification  of  severity  of  voice  quality  or  vocal 
dysfunction,  (3)  speaker  normalization  to  improve  the 
performance  of  a speaker-independent  speech  recognition 
system,  or  development  of  an  objective  distortion  measure 
that  will  incorporate  dynamic  features  of  speech  signals 
(Appendix  B) , (4)  a data  base  of  different  type  voices  to  be 
used  in  training  a speech  processing  system,  (5)  effects  of 
variability  caused  by  vocal  tract,  and  (6)  incorporating 
high  level  linguistic  and  supra-segmental  knowledge  through 
rules  and  factors  obtained  in  order  to  analyze  speech 
signals.  Together,  these  can  lead  to  the  development  of  a 
unified  method  to  investigate  speaker  variability. 

In  summary,  through  this  study,  a statistical  model  of 
the  glottal  source  characteristics  was  developed  and  some 
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rules  for  synthesizing  various  type  voices  were  obtained. 
These  results  can  be  used  in  synthesizing  different  voices 
and  can  help  design  a more  sophisticated  speech  synthesizer. 
A robust  analysis  method  was  derived  that  was  useful  for 
both  speech  synthesis  and  speech  recognition  since  it 
identified  major  acoustic  factors  of  speech  signals, 
especially  for  the  glottal  source  model. 

5.2  Future  Work 

There  is  no  established  consensus  among  researchers 
concerning  the  sources  and  functions  of  speech  variability 
as  well  as  the  nature  of  speech  invariance,  which  is  crucial 
to  understanding  human  speech  communication.  It  is,  however, 
clear  that  variability  in  the  acoustic  manifestations  of  an 
utterance  is  substantial,  and  is  a result  of  causes  from 
many  sources.  In  order  to  understand  human  speech 
communication  thoroughly,  and  thus  to  advance  the  speech 
processing  systems  in  general,  we  need  to  understand  the 
processes  by  which  variability  arises  and  by  which  listeners 
have  learned  to  adapt  to  acoustic  variation.  In  addition,  we 
need  to  consider  the  acoustic  detail  that  provides 
information  on  the  variability.  The  ultimate  objective  is  to 
find  a set  of  rules  for  the  variations,  which  could  be 
incorporated  into  speech  processing  systems.  Based  on  this 
study  the  following  recommendations  could  help  us  progress 
toward  this  objective: 
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(1)  The  inverse  filtering  algorithm  for  sentences 
should  be  improved  to  accommodate  relatively  large 
variations  in  pitch  period  and  glottal  timing  (open  and 
closed  instants)  for  deviant  voices  such  as  hoarse  and  harsh 
voices.  One  may  want  to  incorporate  a VUMS 
(Voiced/Unvoiced/Mixed/Silence)  or  VUMNS 
(Voiced/Unvoiced/Mixed/Nasal/Silence)  classifier  so  that 
glottal  inverse  filtering  could  be  performed  with  specified 
strategies  depending  on  the  class  of  speech  signals. 

(2)  The  effects  of  interaction  with  supraglottal 
resonances,  and  the  prosodies  of  connected  speech  on  the 
source  factors  should  be  systematically  investigated. 

(3)  A procedure  for  determining  the  vocal  tract 
response  together  with  the  parameters  of  the  glottal  pulse 
model  should  be  studied  to  incorporate  vocal  tract 
variations . 

(4)  A quantitative  metric (s)  should  be  studied  in 
order  to  determine  the  severity  rating  of  voice  types,  which 
would  greatly  aid  speech  researchers.  One  example  of 
distortion  measure  is  presented  in  Appendix  B. 


APPENDIX  A 

VERIFICATION  OF  GLOTTAL  INVERSE  FILTERING 

PROGRAM 

To  verify  that  the  glottal  inverse  filtering  program 
runs  correctly,  a synthesized  sentence  with  known  source 
input  was  used  to  test  the  program.  The  input  to  the  glottal 
inverse  filtering  program  was  a synthesized  speech  data 
which  was  generated  by  concatenating  two  synthesized 
sustained  vowels,  /a/,  with  20 [ms]  transient  between  them. 
To  synthesize  the  sustained  vowel,  the  input  was  a periodic 
sequence  of  LF  modeled  glottal  pulses  with  a zero  return 
phase.  The  pitch  frequency  of  the  glottal  pulses  was  100  Hz. 
Substituting  for  the  EGG  waveform  was  a square  wave  which  is 
positive  during  the  open  phase  and  zero  during  the  closed 
phase.  This  waveform  when  differentiated  gave  the  required 
minima  and  maxima  for  the  analysis  algorithm  to  work. 

The  synthesized  data,  used  as  the  input  to  the 
program,  are  shown  in  Figure  1 . Due  to  the  steep  spectral 
roll-off  of  the  synthesized  speech  data,  adaptive 

pre-emphasis  was  used  in  LP  analysis.  The  results  of 
analysis  are  shown  in  Figure  2.  The  voicing  detection 

algorithm  does  not  consider  the  last  frame  in  a voicing 
portion  of  the  input  sentence  as  voiced,  as  can  be  seen  from 
Figure  2-(b)  and  (e) . Note  that  the  inverse  filtered 
differentiated  glottal  flow  waveforms,  shown  in  Figure  2- (a) 
were  normalized  to  have  unit  power.  We  can  be  confident  from 
the  results  that  the  glottal  inverse  program  works  properly. 
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(a) 


(b) 


(c) 

Figure  1.  Synthesized  data.  (a)  Source  pulses;  (b) 
Synthesized  sentence;  (c)  Waveform  to  represent  EGG. 


Camp  I i tude]  Camp l i tude] 
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(a) 


(b) 

Figure  2.  Results  of  analysis.  (a)  Inverse  filtered 

differentiated  source;  (b)  LF-modeled  differentiated 
source;  (c)  Original  source;  (d)  Inverse  filtered 

source;  (e)  Modeled  source  obtained  from  analyzing  the 
synthesized  signal. 


[amplitude]  Camp l i tude]  [amplitude] 
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(C) 


(d) 


t i m©  Cms] 

(e) 


Figure  2.  Continued. 


APPENDIX  B 


A DISTORTION  MEASURE  EXPLOITING 
DYNAMIC  SPECTRAL  CONTINUITY 

Usually  similarity  (or  dissimilarity)  measures  between 
the  spectral  sequences  of  the  input  speech  and  the  stored 
reference  templates  are  based  on  only  one  single  average  or 
accumulative  figure.  It  has  been  known  that  the 
incorporation  of  dynamic  spectral  features  can  improve  the 
performance  in  speech  processing  systems.  The  importance  of 
considering  the  distortion  as  a process  or  sequence  (instead 
of  just  an  average  distortion)  was  emphasized  theoretically 
by  Juang  (1984b)  and  found  experimentally  by  Childers  et 
al.  (1989).  The  underlying  hypothesis  here  is  that 
supra-segmental  information  carries  relatively  more 
speaker-independent  information  than  speaker-dependent. 
Recently  some  researchers  reported  that  the  dynamic 
(transitional)  spectral  features  give  more 
speaker-independent  information  and  are  less  susceptible  to 
a transmission  channel  mismatch  [Aikawa  and  Furui,  1988; 
Soong  and  Rosenberg,  1988] . Thus  a combined  similarity 
measure  (temporal  and  dynamic)  is  generally  helpful  to 
reduce  speaker  variability. 

We  propose  a new  spectral  transitive  function  (STF) 
based  on  the  Juang' s proposal.  Juang  (1984b)  proposed  a 
spectral  transitive  function  in  the  form 
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(B-l) 


where  m is  the  time  index,  is  a time  constant  accounting 
for  short-time  auditory  memory,  and  Pisd/Amtz),  1/Am_n(z)) 
denotes  the  Itakura-Saito  distortion  measure  between  Am(z) 
and  Am_n(z),  which  are  the  optimal  M-th  order  inverse  filter 
of  speech  spectra  Xm(z)  and  Xm_n(z),  respectively.  The 
spectral  transitive  function  O(m)  measures  the  all-pole 
spectral  change  in  the  speech  signal,  i.e.,  it  measures  the 
distortion  resulting  from  replacing  the  current  spectral 
envelope  with  previous  spectral  envelopes.  The  spectral 
transitive  is  to  be  regarded  as  part  of  the  speech  signal. 
The  distortion,  or  noise,  in  the  transitive  function  thus 
provides  a measure  of  the  time-variant  spectral  distortion 
that  affects  the  spectral  continuity  in  the  original  signal. 

We  define  a new  transitive  function  in  the  cepstral 
domain  based  on  equation  (B-l) . If  we  substitute  the 
Itakura-Saito  measure  by  a cepstral  distortion  measure  in 
the  spectral  transitive  function,  then  the  transitive 
function  can  be  expressed  as 
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oo  L 

= X e_nAf  £ 2 [Ck(m)  - Ck(m-n)]2  (B-2) 

n-0  k-1 

where  Ck  (m)  and  Ck(m-n),  for  k=l,2,...,L,  are  k-th  cepstral 
coefficient  of  all-pole  model  1/Am(z)  and  1/Am_n(z), 
respectively.  Considering  only  the  shape  change  of  spectra 
in  time,  we  can  use  the  energy  normalized  all-pole  spectra 
in  equation  (B-2),  which  is  the  proposed  distortion  measure. 
Equation  (B-2)  can  be  rewritten  as 

L 

<I>  (m)  = 2 ^ 0k  (m)  (B-3) 

k-l 

where 

00 

0k(m)=X  e_raf  [ck(m)  “ Ck(m-n)]2  (B-4) 

n-0 

{ <J>k  (nx) , k=l,2,...,L)  are  the  transitive  functions  of 

cepstral  coefficients. 

It  is  worth  considering  the  qualitative  differences 
between  the  proposed  spectral  transitive  function  and  the 
one  used  by  Aikawa  and  Furui  (1988)  , as  well  as  Soong  and 
Rosenberg  (1988).  For  the  sake  of  convenience  let  <j)k  (m) 
represent  the  proposed  transitive  function  of  each  cepstral 
coefficient  as  in  equation  (B-4),  and  let  ACk(m)  represent 
the  one  proposed  by  others: 
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AC*  (m)  = Ck  (m)  - Cjc  (m  - 1)  (B-5) 

i.e.,  ACk(m)  is  simply  the  time-differentiated  m-th  cepstral 
coefficient.  First,  while  AC*  (m)  is  based  only  on  log 
spectral  variation  in  time,  ^ (m)  considers  the  auditory 
memory  effect  as  well  as  the  variance  of  log  spectra  in 
time.  Thus  <t>k  (m)  may  give,  in  principle,  more  reliable 
spectral  transitive  information  than  AC*  (m)  . Second,  as 
ACk  (m)  is  approximated  by  a finite  difference  of  log 
spectra,  it  is  intrinsically  noisy  [Soong  and  Rosenberg, 
1988] . Soong  and  Rosenberg  used  spectral  coefficients  of 
the  first-order  polynomial  fitting  to  the 

time-differentiated  cepstral  coefficients.  <J>k  (m)  is  also 
approximated  by  the  finite  difference  of  log  spectra,  but 
unlike  ACk(m),  it  is  smoothed  by  the  auditory  memory 
function  and  hence  is  less  noisy  than  AC*  (m)  . It  may  not 
need  extra  smoothing  efforts. 
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